[v2,5/7] net/mlx5: e-switch VXLAN tunnel devices management

Message ID 1539612815-47199-6-git-send-email-viacheslavo@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Shahaf Shuler
Headers
Series net/mlx5: e-switch VXLAN encap/decap hardware offload |

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch warning coding style issues

Commit Message

Slava Ovsiienko Oct. 15, 2018, 2:13 p.m. UTC
  VXLAN interfaces are dynamically created for each local UDP port
of outer networks and then used as targets for TC "flower" filters
in order to perform encapsulation. These VXLAN interfaces are
system-wide, the only one device with given UDP port can exist
in the system (the attempt of creating another device with the
same UDP local port returns EEXIST), so PMD should support the
shared device instances database for PMD instances. These VXLAN
implicitly created devices are called VTEPs (Virtual Tunnel
End Points).

Creation of the VTEP occurs at the moment of rule applying. The
link is set up, root ingress qdisc is also initialized.

Encapsulation VTEPs are created on per port basis, the single
VTEP is attached to the outer interface and is shared for all
encapsulation rules on this interface. The source UDP port is
automatically selected in range 30000-60000.

For decapsulaton one VTEP is created per every unique UDP
local port to accept tunnel traffic. The name of created
VTEP consists of prefix "vmlx_" and the number of UDP port in
decimal digits without leading zeros (vmlx_4789). The VTEP
can be preliminary created in the system before the launching
application, it allows to share	UDP ports between primary
and secondary processes.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 503 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 499 insertions(+), 4 deletions(-)
  

Comments

Yongseok Koh Oct. 25, 2018, 12:28 a.m. UTC | #1
On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> VXLAN interfaces are dynamically created for each local UDP port
> of outer networks and then used as targets for TC "flower" filters
> in order to perform encapsulation. These VXLAN interfaces are
> system-wide, the only one device with given UDP port can exist
> in the system (the attempt of creating another device with the
> same UDP local port returns EEXIST), so PMD should support the
> shared device instances database for PMD instances. These VXLAN
> implicitly created devices are called VTEPs (Virtual Tunnel
> End Points).
> 
> Creation of the VTEP occurs at the moment of rule applying. The
> link is set up, root ingress qdisc is also initialized.
> 
> Encapsulation VTEPs are created on per port basis, the single
> VTEP is attached to the outer interface and is shared for all
> encapsulation rules on this interface. The source UDP port is
> automatically selected in range 30000-60000.
> 
> For decapsulaton one VTEP is created per every unique UDP
> local port to accept tunnel traffic. The name of created
> VTEP consists of prefix "vmlx_" and the number of UDP port in
> decimal digits without leading zeros (vmlx_4789). The VTEP
> can be preliminary created in the system before the launching
> application, it allows to share	UDP ports between primary
> and secondary processes.
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_flow_tcf.c | 503 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 499 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index d6840d5..efa9c3b 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -3443,6 +3443,432 @@ struct pedit_parser {
>  	return -err;
>  }
>  
> +/* VTEP device list is shared between PMD port instances. */
> +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> +			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
> +static pthread_mutex_t vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;

What's the reason for choosing pthread_mutex instead of rte_*_lock?

> +
> +/**
> + * Deletes VTEP network device.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] vtep
> + *   Object represinting the network device to delete. Memory
> + *   allocated for this object is freed by routine.
> + */
> +static void
> +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> +		      struct mlx5_flow_tcf_vtep *vtep)
> +{
> +	struct nlmsghdr *nlh;
> +	struct ifinfomsg *ifm;
> +	alignas(struct nlmsghdr)
> +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> +	int ret;
> +
> +	assert(!vtep->refcnt);
> +	if (vtep->created && vtep->ifindex) {

First of all vtep->created seems of no use. It is introduced to select the error
message in flow_tcf_create_iface(). I don't see any necessity to distinguish
between 'vtep is allocated by rte_malloc()' and 'vtep is created in kernel'.

And why do you need to check vtep->ifindex as well? If vtep is created in kernel
and its ifindex isn't set, that should be an error which had to be hanled in
flow_tcf_create_iface(). Such a vtep shouldn't exist.

Also, the refcnt management is a bit strange. Please put an abstraction by
adding create_iface(), get_iface() and release_iface(). In the get_ifce(),
vtep->refcnt should be incremented. And in the release_iface(), it decrease the
refcnt and if it reaches to zero, the iface can be removed. create_iface() will
set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does searching
the list not by repeating the same lookup code here and there. That will make
your code much simpler.

> +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> +		nlh = mnl_nlmsg_put_header(buf);
> +		nlh->nlmsg_type = RTM_DELLINK;
> +		nlh->nlmsg_flags = NLM_F_REQUEST;
> +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> +		ifm->ifi_family = AF_UNSPEC;
> +		ifm->ifi_index = vtep->ifindex;
> +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +		if (ret)
> +			DRV_LOG(WARNING, "netlink: error deleting VXLAN "
> +					 "encap/decap ifindex %u",
> +					 ifm->ifi_index);
> +	}
> +	rte_free(vtep);
> +}
> +
> +/**
> + * Creates VTEP network device.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifouter
> + *   Outer interface to attach new-created VXLAN device
> + *   If zero the VXLAN device will not be attached to any device.
> + * @param[in] port
> + *   UDP port of created VTEP device.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + * Pointer to created device structure on success, NULL otherwise
> + * and rte_errno is set.
> + */
> +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA

Why negative(ifndef) first intead of positive(ifdef)?

> +static struct mlx5_flow_tcf_vtep*
> +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
> +		      unsigned int ifouter __rte_unused,
> +		      uint16_t port __rte_unused,
> +		      struct rte_flow_error *error)
> +{
> +	rte_flow_error_set(error, ENOTSUP,
> +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +			 "netlink: failed to create VTEP, "
> +			 "VXLAN metadat is not supported by kernel");

Typo.

> +	return NULL;
> +}
> +#else
> +static struct mlx5_flow_tcf_vtep*
> +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,

How about adding 'vtep'? It sounds vague - creating a general interface.
E.g., flow_tcf_create_vtep_iface()?

> +		      unsigned int ifouter,
> +		      uint16_t port, struct rte_flow_error *error)
> +{
> +	struct mlx5_flow_tcf_vtep *vtep;
> +	struct nlmsghdr *nlh;
> +	struct ifinfomsg *ifm;
> +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> +	alignas(struct nlmsghdr)
> +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +

Use a macro for '128'. Can't know the meaning.

> +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> +		       SZ_NLATTR_NEST * 2 +
> +		       SZ_NLATTR_STRZ_OF("vxlan") +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> +	struct nlattr *na_info;
> +	struct nlattr *na_vxlan;
> +	rte_be16_t vxlan_port = RTE_BE16(port);

Use rte_cpu_to_be_*() instead.

> +	int ret;
> +
> +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> +			alignof(struct mlx5_flow_tcf_vtep));
> +	if (!vtep) {
> +		rte_flow_error_set
> +			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			 NULL, "unadble to allocate memory for VTEP desc");
> +		return NULL;
> +	}
> +	*vtep = (struct mlx5_flow_tcf_vtep){
> +			.refcnt = 0,
> +			.port = port,
> +			.created = 0,
> +			.ifouter = 0,
> +			.ifindex = 0,
> +			.local = LIST_HEAD_INITIALIZER(),
> +			.neigh = LIST_HEAD_INITIALIZER(),
> +	};
> +	memset(buf, 0, sizeof(buf));
> +	nlh = mnl_nlmsg_put_header(buf);
> +	nlh->nlmsg_type = RTM_NEWLINK;
> +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  | NLM_F_EXCL;
> +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> +	ifm->ifi_family = AF_UNSPEC;
> +	ifm->ifi_type = 0;
> +	ifm->ifi_index = 0;
> +	ifm->ifi_flags = IFF_UP;
> +	ifm->ifi_change = 0xffffffff;
> +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX, port);
> +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> +	assert(na_info);
> +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> +	if (ifouter)
> +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> +	assert(na_vxlan);
> +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> +	mnl_attr_nest_end(nlh, na_vxlan);
> +	mnl_attr_nest_end(nlh, na_info);
> +	assert(sizeof(buf) >= nlh->nlmsg_len);
> +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +	if (ret)
> +		DRV_LOG(WARNING,
> +			"netlink: VTEP %s create failure (%d)",
> +			name, rte_errno);
> +	else
> +		vtep->created = 1;

Flow of code here isn't smooth, thus could be error-prone. Most of all, I don't
like ret has multiple meanings. ret should be return value but you are using it
to store ifindex.

> +	if (ret && ifouter)
> +		ret = 0;
> +	else
> +		ret = if_nametoindex(name);

If vtep isn't created and ifouter is set, then skip init below, which
means, if vtep is created or ifouter is set, it tries to get ifindex of vtep.
But why do you want to try to call this API even if it failed to create vtep?
Let's not make code flow convoluted even though it logically works. Let's make
it straightforward.

> +	if (ret) {
> +		vtep->ifindex = ret;
> +		vtep->ifouter = ifouter;
> +		memset(buf, 0, sizeof(buf));
> +		nlh = mnl_nlmsg_put_header(buf);
> +		nlh->nlmsg_type = RTM_NEWLINK;
> +		nlh->nlmsg_flags = NLM_F_REQUEST;
> +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> +		ifm->ifi_family = AF_UNSPEC;
> +		ifm->ifi_type = 0;
> +		ifm->ifi_index = vtep->ifindex;
> +		ifm->ifi_flags = IFF_UP;
> +		ifm->ifi_change = IFF_UP;
> +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +		if (ret) {
> +			DRV_LOG(WARNING,
> +				"netlink: VTEP %s set link up failure (%d)",
> +				name, rte_errno);
> +			rte_free(vtep);
> +			rte_flow_error_set
> +				(error, -errno,
> +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +				 "netlink: failed to set VTEP link up");
> +			vtep = NULL;
> +		} else {
> +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> +			if (ret)
> +				DRV_LOG(WARNING,
> +				"VTEP %s init failure (%d)", name, rte_errno);
> +		}
> +	} else {
> +		DRV_LOG(WARNING,
> +			"VTEP %s failed to get index (%d)", name, errno);
> +		rte_flow_error_set
> +			(error, -errno,
> +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +			 !vtep->created ? "netlink: failed to create VTEP" :
> +			 "netlink: failed to retrieve VTEP ifindex");
> +			 ret = 1;

If it fails to create a vtep above, it will print out two warning messages and
one rte_flow_error message. And it even selects message to print between two?
And there's another info msg at the end even in case of failure. Do you really
want to do this even with manipulating ret to change code path?  Not a good
practice.

Usually, code path should be straightforward for sucessful path and for
errors/failures, return immediately or use 'goto' if there's need for cleanup.

Please refactor entire function.

> +	}
> +	if (ret) {
> +		flow_tcf_delete_iface(tcf, vtep);
> +		vtep = NULL;
> +	}
> +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" : "error");
> +	return vtep;
> +}
> +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> +
> +/**
> + * Create target interface index for VXLAN tunneling decapsulation.
> + * In order to share the UDP port within the other interfaces the
> + * VXLAN device created as not attached to any interface (if created).
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + * @return
> + *   Interface index on success, zero otherwise and rte_errno is set.

Return negative errno in case of failure like others.

 *   Interface index on success, a negative errno value otherwise and rte_errno is set.

> + */
> +static unsigned int
> +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> +			   struct mlx5_flow *dev_flow,
> +			   struct rte_flow_error *error)
> +{
> +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> +
> +	vtep = NULL;
> +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +		if (vlst->port == port) {
> +			vtep = vlst;
> +			break;
> +		}
> +	}

You just need one variable.

	struct mlx5_flow_tcf_vtep *vtep;

	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
		if (vtep->port == port)
			break;
	}

> +	if (!vtep) {
> +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> +		if (vtep)
> +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> +	} else {
> +		if (vtep->ifouter) {
> +			rte_flow_error_set(error, -errno,
> +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +				"Failed to create decap VTEP, attached "
> +				"device with the same UDP port exists");
> +				vtep = NULL;

Making vtep null to skip the following code? Please merge the two same if/else
and make the code path strightforward. And which errno do you expect here?
Should it be set EEXIST instead?

> +		}
> +	}
> +	if (vtep) {
> +		vtep->refcnt++;
> +		assert(vtep->ifindex);
> +		return vtep->ifindex;
> +	} else {
> +		return 0;
> +	}

Why repeating same if/else?


This is my suggestion but if you take my suggestion to have
flow_tcf_[create|get|release]_iface(), this will get much simpler.

{
	struct mlx5_flow_tcf_vtep *vtep;
	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;

	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
		if (vtep->port == port)
			break;
	}
	if (vtep && vtep->ifouter)
		return rte_flow_error_set(... EEXIST ...);
	else if (vtep) {
		++vtep->refcnt;
	} else {
		vtep = flow_tcf_create_iface(tcf, 0, port, error);
		if (!vtep)
			return rte_flow_error_set(...);
		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
	}
	assert(vtep->ifindex);
	return vtep->ifindex;
}


> +}
> +
> +/**
> + * Creates target interface index for VXLAN tunneling encapsulation.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifouter
> + *   Network interface index to attach VXLAN encap device to.
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + * @return
> + *   Interface index on success, zero otherwise and rte_errno is set.
> + */
> +static unsigned int
> +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifouter,
> +			    struct mlx5_flow *dev_flow __rte_unused,
> +			    struct rte_flow_error *error)
> +{
> +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> +
> +	assert(ifouter);
> +	/* Look whether the attached VTEP for encap is created. */
> +	vtep = NULL;
> +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +		if (vlst->ifouter == ifouter) {
> +			vtep = vlst;
> +			break;
> +		}
> +	}

Same here.

> +	if (!vtep) {
> +		uint16_t pcnt;
> +
> +		/* Not found, we should create the new attached VTEP. */
> +/*
> + * TODO: not implemented yet
> + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
> + */

Personal note is not appropriate even though it is removed in the following
patch.

> +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> +				     - MLX5_VXLAN_PORT_RANGE_MIN); pcnt++) {
> +			encap_port++;
> +			/* Wraparound the UDP port index. */
> +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN ||
> +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> +				encap_port = MLX5_VXLAN_PORT_RANGE_MIN;
> +			/* Check whether UDP port is in already in use. */
> +			vtep = NULL;
> +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +				if (vlst->port == encap_port) {
> +					vtep = vlst;
> +					break;
> +				}
> +			}

If you want to find out an empty port number, you can use rte_bitmap instead of
repeating searching the entire list for all possible port numbers.

> +			if (vtep) {
> +				vtep = NULL;
> +				continue;
> +			}
> +			vtep = flow_tcf_create_iface(tcf, ifouter,
> +						     encap_port, error);
> +			if (vtep) {
> +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> +				break;
> +			}
> +			if (rte_errno != EEXIST)
> +				break;
> +		}
> +	}
> +	if (!vtep)
> +		return 0;
> +	vtep->refcnt++;
> +	assert(vtep->ifindex);
> +	return vtep->ifindex;

Please refactor this func according to what I suggested for
flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().

> +}
> +
> +/**
> + * Creates target interface index for tunneling of any type.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifouter
> + *   Network interface index to attach VXLAN encap device to.
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + * @return
> + *   Interface index on success, zero otherwise and rte_errno is set.

 *   Interface index on success, a negative errno value otherwise and
 *   rte_errno is set.

> + */
> +static unsigned int
> +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifouter,
> +			    struct mlx5_flow *dev_flow,
> +			    struct rte_flow_error *error)
> +{
> +	unsigned int ret;
> +
> +	assert(dev_flow->tcf.tunnel);
> +	pthread_mutex_lock(&vtep_list_mutex);
> +	switch (dev_flow->tcf.tunnel->type) {
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> +						 dev_flow, error);
> +		break;
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> +		break;
> +	default:
> +		rte_flow_error_set(error, ENOTSUP,
> +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +				"unsupported tunnel type");
> +		ret = 0;
> +		break;
> +	}
> +	pthread_mutex_unlock(&vtep_list_mutex);
> +	return ret;
> +}
> +
> +/**
> + * Deletes tunneling interface by UDP port.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifindex
> + *   Network interface index of VXLAN device.
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + */
> +static void
> +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifindex,
> +			    struct mlx5_flow *dev_flow)
> +{
> +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> +
> +	assert(dev_flow->tcf.tunnel);
> +	pthread_mutex_lock(&vtep_list_mutex);
> +	vtep = NULL;
> +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +		if (vlst->ifindex == ifindex) {
> +			vtep = vlst;
> +			break;
> +		}
> +	}

It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel instead
of ifindex_tun which is same as vtep->ifindex like the assertion below. Then,
this lookup can be skipped.

> +	if (!vtep) {
> +		DRV_LOG(WARNING, "No VTEP device found in the list");
> +		goto exit;
> +	}
> +	switch (dev_flow->tcf.tunnel->type) {
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> +		break;
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> +/*
> + * TODO: Remove the encap ancillary rules first.
> + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> + */

Is it a personal note? Please remove.

> +		break;
> +	default:
> +		assert(false);
> +		DRV_LOG(WARNING, "Unsupported tunnel type");
> +		break;
> +	}
> +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> +	assert(vtep->refcnt);
> +	if (!vtep->refcnt || !--vtep->refcnt) {
> +		LIST_REMOVE(vtep, next);
> +		flow_tcf_delete_iface(tcf, vtep);
> +	}
> +exit:
> +	pthread_mutex_unlock(&vtep_list_mutex);
> +}
> +
>  /**
>   * Apply flow to E-Switch by sending Netlink message.
>   *
> @@ -3461,18 +3887,61 @@ struct pedit_parser {
>  	       struct rte_flow_error *error)
>  {
>  	struct priv *priv = dev->data->dev_private;
> -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
>  	struct mlx5_flow *dev_flow;
>  	struct nlmsghdr *nlh;
> +	int ret;
>  
>  	dev_flow = LIST_FIRST(&flow->dev_flows);
>  	/* E-Switch flow can't be expanded. */
>  	assert(!LIST_NEXT(dev_flow, next));
> +	if (dev_flow->tcf.applied)
> +		return 0;
>  	nlh = dev_flow->tcf.nlh;
>  	nlh->nlmsg_type = RTM_NEWTFILTER;
>  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
> -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> +	if (dev_flow->tcf.tunnel) {
> +		/*
> +		 * Replace the interface index, target for
> +		 * encapsulation, source for decapsulation.
> +		 */
> +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> +		/* Create actual VTEP device when rule is being applied. */
> +		dev_flow->tcf.tunnel->ifindex_tun
> +			= flow_tcf_tunnel_vtep_create(tcf,
> +					*dev_flow->tcf.tunnel->ifindex_ptr,
> +					dev_flow, error);
> +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> +				dev_flow->tcf.tunnel->ifindex_tun,
> +				*dev_flow->tcf.tunnel->ifindex_ptr);
> +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> +			return -rte_errno;
> +		dev_flow->tcf.tunnel->ifindex_org
> +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_tun;
> +	}
> +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +	if (dev_flow->tcf.tunnel) {
> +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> +				dev_flow->tcf.tunnel->ifindex_org,
> +				*dev_flow->tcf.tunnel->ifindex_ptr);
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_org;
> +		dev_flow->tcf.tunnel->ifindex_org = 0;

ifindex_org looks a temporary storage in this code. And this kind of hassle
(replace/restore) is there because you took the ifindex from the netlink
message. Why don't you have just

struct mlx5_flow_tcf_tunnel_hdr {
	uint32_t type; /**< Tunnel action type. */
	unsigned int ifindex; /**< Original dst/src interface */
	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message. */
};

and don't change ifindex?

Thanks,
Yongseok

> +	}
> +	if (!ret) {
> +		dev_flow->tcf.applied = 1;
>  		return 0;
> +	}
> +	DRV_LOG(WARNING, "netlink: failed to create TC rule (%d)", rte_errno);
> +	if (dev_flow->tcf.tunnel->ifindex_tun) {
> +		flow_tcf_tunnel_vtep_delete(tcf,
> +					    dev_flow->tcf.tunnel->ifindex_tun,
> +					    dev_flow);
> +		dev_flow->tcf.tunnel->ifindex_tun = 0;
> +	}
>  	return rte_flow_error_set(error, rte_errno,
>  				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
>  				  "netlink: failed to create TC flow rule");
> @@ -3490,7 +3959,7 @@ struct pedit_parser {
>  flow_tcf_remove(struct rte_eth_dev *dev, struct rte_flow *flow)
>  {
>  	struct priv *priv = dev->data->dev_private;
> -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
>  	struct mlx5_flow *dev_flow;
>  	struct nlmsghdr *nlh;
>  
> @@ -3501,10 +3970,36 @@ struct pedit_parser {
>  		return;
>  	/* E-Switch flow can't be expanded. */
>  	assert(!LIST_NEXT(dev_flow, next));
> +	if (!dev_flow->tcf.applied)
> +		return;
> +	if (dev_flow->tcf.tunnel) {
> +		/*
> +		 * Replace the interface index, target for
> +		 * encapsulation, source for decapsulation.
> +		 */
> +		assert(dev_flow->tcf.tunnel->ifindex_tun);
> +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> +		dev_flow->tcf.tunnel->ifindex_org
> +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_tun;
> +	}
>  	nlh = dev_flow->tcf.nlh;
>  	nlh->nlmsg_type = RTM_DELTFILTER;
>  	nlh->nlmsg_flags = NLM_F_REQUEST;
> -	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
> +	flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +	if (dev_flow->tcf.tunnel) {
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_org;
> +		dev_flow->tcf.tunnel->ifindex_org = 0;
> +		if (dev_flow->tcf.tunnel->ifindex_tun) {
> +			flow_tcf_tunnel_vtep_delete(tcf,
> +					dev_flow->tcf.tunnel->ifindex_tun,
> +					dev_flow);
> +			dev_flow->tcf.tunnel->ifindex_tun = 0;
> +		}
> +	}
> +	dev_flow->tcf.applied = 0;
>  }
>  
>  /**
>
  
Slava Ovsiienko Oct. 25, 2018, 8:21 p.m. UTC | #2
> -----Original Message-----
> From: Yongseok Koh
> Sent: Thursday, October 25, 2018 3:28
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> management
> 
> On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > VXLAN interfaces are dynamically created for each local UDP port of
> > outer networks and then used as targets for TC "flower" filters in
> > order to perform encapsulation. These VXLAN interfaces are
> > system-wide, the only one device with given UDP port can exist in the
> > system (the attempt of creating another device with the same UDP local
> > port returns EEXIST), so PMD should support the shared device
> > instances database for PMD instances. These VXLAN implicitly created
> > devices are called VTEPs (Virtual Tunnel End Points).
> >
> > Creation of the VTEP occurs at the moment of rule applying. The link
> > is set up, root ingress qdisc is also initialized.
> >
> > Encapsulation VTEPs are created on per port basis, the single VTEP is
> > attached to the outer interface and is shared for all encapsulation
> > rules on this interface. The source UDP port is automatically selected
> > in range 30000-60000.
> >
> > For decapsulaton one VTEP is created per every unique UDP local port
> > to accept tunnel traffic. The name of created VTEP consists of prefix
> > "vmlx_" and the number of UDP port in decimal digits without leading
> > zeros (vmlx_4789). The VTEP can be preliminary created in the system
> > before the launching
> > application, it allows to share	UDP ports between primary
> > and secondary processes.
> >
> > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 499 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > index d6840d5..efa9c3b 100644
> > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> >  	return -err;
> >  }
> >
> > +/* VTEP device list is shared between PMD port instances. */ static
> > +LIST_HEAD(, mlx5_flow_tcf_vtep)
> > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> pthread_mutex_t
> > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> 
> What's the reason for choosing pthread_mutex instead of rte_*_lock?

The sharing this database for secondary processes?

> 
> > +
> > +/**
> > + * Deletes VTEP network device.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] vtep
> > + *   Object represinting the network device to delete. Memory
> > + *   allocated for this object is freed by routine.
> > + */
> > +static void
> > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > +	struct nlmsghdr *nlh;
> > +	struct ifinfomsg *ifm;
> > +	alignas(struct nlmsghdr)
> > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > +	int ret;
> > +
> > +	assert(!vtep->refcnt);
> > +	if (vtep->created && vtep->ifindex) {
> 
> First of all vtep->created seems of no use. It is introduced to select the error
> message in flow_tcf_create_iface(). I don't see any necessity to distinguish
> between 'vtep is allocated by rte_malloc()' and 'vtep is created in kernel'.

created flag indicates the iface is created by our code.
The VXLAN decap devices must have the specified UDP port, we can not create
multiple VXLAN devices with the same UDP port - EEXIST is returned. So, we have
to share device. One option is create device before DPDK application launch and use
these pre-created devices. Inthis case created flag is not set and VXLAN device
is not reinitialized, and not deleted.

> And why do you need to check vtep->ifindex as well? If vtep is created in
> kernel and its ifindex isn't set, that should be an error which had to be hanled
> in flow_tcf_create_iface(). Such a vtep shouldn't exist.
Yes, if we did not get ifindex of device - vtep is not created, error returned.
We just can not operate w/o ifindex.

> 
> Also, the refcnt management is a bit strange. Please put an abstraction by
> adding create_iface(), get_iface() and release_iface(). In the get_ifce(),
> vtep->refcnt should be incremented. And in the release_iface(), it
> vtep->decrease the
OK. Good proposal. I'll refactor the code.

> refcnt and if it reaches to zero, the iface can be removed. create_iface() will
> set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does
> searching the list not by repeating the same lookup code here and there.
> That will make your code much simpler.
> 
> > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > +		nlh = mnl_nlmsg_put_header(buf);
> > +		nlh->nlmsg_type = RTM_DELLINK;
> > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > +		ifm->ifi_family = AF_UNSPEC;
> > +		ifm->ifi_index = vtep->ifindex;
> > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +		if (ret)
> > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> "
> > +					 "encap/decap ifindex %u",
> > +					 ifm->ifi_index);
> > +	}
> > +	rte_free(vtep);
> > +}
> > +
> > +/**
> > + * Creates VTEP network device.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifouter
> > + *   Outer interface to attach new-created VXLAN device
> > + *   If zero the VXLAN device will not be attached to any device.
> > + * @param[in] port
> > + *   UDP port of created VTEP device.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + * Pointer to created device structure on success, NULL otherwise
> > + * and rte_errno is set.
> > + */
> > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> 
> Why negative(ifndef) first intead of positive(ifdef)?
Hm. Did I miss the rule. Positive #ifdef first? OK.

> 
> > +static struct mlx5_flow_tcf_vtep*
> > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
> > +		      unsigned int ifouter __rte_unused,
> > +		      uint16_t port __rte_unused,
> > +		      struct rte_flow_error *error) {
> > +	rte_flow_error_set(error, ENOTSUP,
> > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > +			 "netlink: failed to create VTEP, "
> > +			 "VXLAN metadat is not supported by kernel");
> 
> Typo.

OK.  "metadata are not supported".
> 
> > +	return NULL;
> > +}
> > +#else
> > +static struct mlx5_flow_tcf_vtep*
> > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> 
> How about adding 'vtep'? It sounds vague - creating a general interface.
> E.g., flow_tcf_create_vtep_iface()?

OK.

> 
> > +		      unsigned int ifouter,
> > +		      uint16_t port, struct rte_flow_error *error) {
> > +	struct mlx5_flow_tcf_vtep *vtep;
> > +	struct nlmsghdr *nlh;
> > +	struct ifinfomsg *ifm;
> > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > +	alignas(struct nlmsghdr)
> > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> 
> Use a macro for '128'. Can't know the meaning.
OK. I think we should calculate the buffer size explicitly.

> 
> > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > +		       SZ_NLATTR_NEST * 2 +
> > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > +	struct nlattr *na_info;
> > +	struct nlattr *na_vxlan;
> > +	rte_be16_t vxlan_port = RTE_BE16(port);
> 
> Use rte_cpu_to_be_*() instead.

Yes, I'll recheck the whole code for this issue.

> 
> > +	int ret;
> > +
> > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > +			alignof(struct mlx5_flow_tcf_vtep));
> > +	if (!vtep) {
> > +		rte_flow_error_set
> > +			(error, ENOMEM,
> RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > +			 NULL, "unadble to allocate memory for VTEP desc");
> > +		return NULL;
> > +	}
> > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > +			.refcnt = 0,
> > +			.port = port,
> > +			.created = 0,
> > +			.ifouter = 0,
> > +			.ifindex = 0,
> > +			.local = LIST_HEAD_INITIALIZER(),
> > +			.neigh = LIST_HEAD_INITIALIZER(),
> > +	};
> > +	memset(buf, 0, sizeof(buf));
> > +	nlh = mnl_nlmsg_put_header(buf);
> > +	nlh->nlmsg_type = RTM_NEWLINK;
> > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> NLM_F_EXCL;
> > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > +	ifm->ifi_family = AF_UNSPEC;
> > +	ifm->ifi_type = 0;
> > +	ifm->ifi_index = 0;
> > +	ifm->ifi_flags = IFF_UP;
> > +	ifm->ifi_change = 0xffffffff;
> > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> port);
> > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > +	assert(na_info);
> > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > +	if (ifouter)
> > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > +	assert(na_vxlan);
> > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > +	mnl_attr_nest_end(nlh, na_vxlan);
> > +	mnl_attr_nest_end(nlh, na_info);
> > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +	if (ret)
> > +		DRV_LOG(WARNING,
> > +			"netlink: VTEP %s create failure (%d)",
> > +			name, rte_errno);
> > +	else
> > +		vtep->created = 1;
> 
> Flow of code here isn't smooth, thus could be error-prone. Most of all, I
> don't like ret has multiple meanings. ret should be return value but you are
> using it to store ifindex.
> 
> > +	if (ret && ifouter)
> > +		ret = 0;
> > +	else
> > +		ret = if_nametoindex(name);
> 
> If vtep isn't created and ifouter is set, then skip init below, which means, if

ifouter is set for VXLAN encap devices. They should be attached to ifouter
and can not be shared. So, if ifouter I set - we do not use the precreated/existing
VXLAN devices. We have to create our own not shared device.

> vtep is created or ifouter is set, it tries to get ifindex of vtep.
> But why do you want to try to call this API even if it failed to create vtep?
> Let's not make code flow convoluted even though it logically works. Let's
> make it straightforward.
> 
> > +	if (ret) {
> > +		vtep->ifindex = ret;
> > +		vtep->ifouter = ifouter;
> > +		memset(buf, 0, sizeof(buf));
> > +		nlh = mnl_nlmsg_put_header(buf);
> > +		nlh->nlmsg_type = RTM_NEWLINK;
> > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > +		ifm->ifi_family = AF_UNSPEC;
> > +		ifm->ifi_type = 0;
> > +		ifm->ifi_index = vtep->ifindex;
> > +		ifm->ifi_flags = IFF_UP;
> > +		ifm->ifi_change = IFF_UP;
> > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +		if (ret) {
> > +			DRV_LOG(WARNING,
> > +				"netlink: VTEP %s set link up failure (%d)",
> > +				name, rte_errno);
> > +			rte_free(vtep);
> > +			rte_flow_error_set
> > +				(error, -errno,
> > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > +				 "netlink: failed to set VTEP link up");
> > +			vtep = NULL;
> > +		} else {
> > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > +			if (ret)
> > +				DRV_LOG(WARNING,
> > +				"VTEP %s init failure (%d)", name, rte_errno);
> > +		}
> > +	} else {
> > +		DRV_LOG(WARNING,
> > +			"VTEP %s failed to get index (%d)", name, errno);
> > +		rte_flow_error_set
> > +			(error, -errno,
> > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > +			 "netlink: failed to retrieve VTEP ifindex");
> > +			 ret = 1;
> 
> If it fails to create a vtep above, it will print out two warning messages and
> one rte_flow_error message. And it even selects message to print between
> two?
> And there's another info msg at the end even in case of failure. Do you really
> want to do this even with manipulating ret to change code path?  Not a good
> practice.
> 
> Usually, code path should be straightforward for sucessful path and for
> errors/failures, return immediately or use 'goto' if there's need for cleanup.
> 
> Please refactor entire function.

I think I'll split it in two ones - for attached and potentially shared ifaces.
> 
> > +	}
> > +	if (ret) {
> > +		flow_tcf_delete_iface(tcf, vtep);
> > +		vtep = NULL;
> > +	}
> > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> "error");
> > +	return vtep;
> > +}
> > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > +
> > +/**
> > + * Create target interface index for VXLAN tunneling decapsulation.
> > + * In order to share the UDP port within the other interfaces the
> > + * VXLAN device created as not attached to any interface (if created).
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + * @return
> > + *   Interface index on success, zero otherwise and rte_errno is set.
> 
> Return negative errno in case of failure like others.

Anyway, we have to return an index. If we do not return it as function result
we will need to provide some extra pointing parameter, it complicates the code.

> 
>  *   Interface index on success, a negative errno value otherwise and
> rte_errno is set.
> 
> > + */
> > +static unsigned int
> > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > +			   struct mlx5_flow *dev_flow,
> > +			   struct rte_flow_error *error)
> > +{
> > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > +
> > +	vtep = NULL;
> > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +		if (vlst->port == port) {
> > +			vtep = vlst;
> > +			break;
> > +		}
> > +	}
> 
> You just need one variable.

Yes. There is a long story, I forgot to revert code to one variable after debugging.
> 
> 	struct mlx5_flow_tcf_vtep *vtep;
> 
> 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> 		if (vtep->port == port)
> 			break;
> 	}
> 
> > +	if (!vtep) {
> > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > +		if (vtep)
> > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > +	} else {
> > +		if (vtep->ifouter) {
> > +			rte_flow_error_set(error, -errno,
> > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > +				"Failed to create decap VTEP, attached "
> > +				"device with the same UDP port exists");
> > +				vtep = NULL;
> 
> Making vtep null to skip the following code?

Yes. To avoid multiple return operators in code.
> Please merge the two same
> if/else and make the code path strightforward. And which errno do you
> expect here?
> Should it be set EEXIST instead?
Not always. Netlink returns the code. 

> 
> > +		}
> > +	}
> > +	if (vtep) {
> > +		vtep->refcnt++;
> > +		assert(vtep->ifindex);
> > +		return vtep->ifindex;
> > +	} else {
> > +		return 0;
> > +	}
> 
> Why repeating same if/else?
> 
> 
> This is my suggestion but if you take my suggestion to have
> flow_tcf_[create|get|release]_iface(), this will get much simpler.
Agree.

> 
> {
> 	struct mlx5_flow_tcf_vtep *vtep;
> 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> 
> 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> 		if (vtep->port == port)
> 			break;
> 	}
> 	if (vtep && vtep->ifouter)
> 		return rte_flow_error_set(... EEXIST ...);
> 	else if (vtep) {
> 		++vtep->refcnt;
> 	} else {
> 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> 		if (!vtep)
> 			return rte_flow_error_set(...);
> 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> 	}
> 	assert(vtep->ifindex);
> 	return vtep->ifindex;
> }
> 
> 
> > +}
> > +
> > +/**
> > + * Creates target interface index for VXLAN tunneling encapsulation.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifouter
> > + *   Network interface index to attach VXLAN encap device to.
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + * @return
> > + *   Interface index on success, zero otherwise and rte_errno is set.
> > + */
> > +static unsigned int
> > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifouter,
> > +			    struct mlx5_flow *dev_flow __rte_unused,
> > +			    struct rte_flow_error *error)
> > +{
> > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > +
> > +	assert(ifouter);
> > +	/* Look whether the attached VTEP for encap is created. */
> > +	vtep = NULL;
> > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +		if (vlst->ifouter == ifouter) {
> > +			vtep = vlst;
> > +			break;
> > +		}
> > +	}
> 
> Same here.
> 
> > +	if (!vtep) {
> > +		uint16_t pcnt;
> > +
> > +		/* Not found, we should create the new attached VTEP. */
> > +/*
> > + * TODO: not implemented yet
> > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> 
> Personal note is not appropriate even though it is removed in the following
> patch.
> 
> > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> pcnt++) {
> > +			encap_port++;
> > +			/* Wraparound the UDP port index. */
> > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> ||
> > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > +				encap_port =
> MLX5_VXLAN_PORT_RANGE_MIN;
> > +			/* Check whether UDP port is in already in use. */
> > +			vtep = NULL;
> > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +				if (vlst->port == encap_port) {
> > +					vtep = vlst;
> > +					break;
> > +				}
> > +			}
> 
> If you want to find out an empty port number, you can use rte_bitmap
> instead of repeating searching the entire list for all possible port numbers.

We do not expect too many VXLAN devices have been created. bitmap.

> 
> > +			if (vtep) {
> > +				vtep = NULL;
> > +				continue;
> > +			}
> > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > +						     encap_port, error);
> > +			if (vtep) {
> > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> next);
> > +				break;
> > +			}
> > +			if (rte_errno != EEXIST)
> > +				break;
> > +		}
> > +	}
> > +	if (!vtep)
> > +		return 0;
> > +	vtep->refcnt++;
> > +	assert(vtep->ifindex);
> > +	return vtep->ifindex;
> 
> Please refactor this func according to what I suggested for
> flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> 
> > +}
> > +
> > +/**
> > + * Creates target interface index for tunneling of any type.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifouter
> > + *   Network interface index to attach VXLAN encap device to.
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + * @return
> > + *   Interface index on success, zero otherwise and rte_errno is set.
> 
>  *   Interface index on success, a negative errno value otherwise and
>  *   rte_errno is set.
> 
> > + */
> > +static unsigned int
> > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifouter,
> > +			    struct mlx5_flow *dev_flow,
> > +			    struct rte_flow_error *error)
> > +{
> > +	unsigned int ret;
> > +
> > +	assert(dev_flow->tcf.tunnel);
> > +	pthread_mutex_lock(&vtep_list_mutex);
> > +	switch (dev_flow->tcf.tunnel->type) {
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > +						 dev_flow, error);
> > +		break;
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > +		break;
> > +	default:
> > +		rte_flow_error_set(error, ENOTSUP,
> > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > +				"unsupported tunnel type");
> > +		ret = 0;
> > +		break;
> > +	}
> > +	pthread_mutex_unlock(&vtep_list_mutex);
> > +	return ret;
> > +}
> > +
> > +/**
> > + * Deletes tunneling interface by UDP port.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifindex
> > + *   Network interface index of VXLAN device.
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + */
> > +static void
> > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifindex,
> > +			    struct mlx5_flow *dev_flow)
> > +{
> > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > +
> > +	assert(dev_flow->tcf.tunnel);
> > +	pthread_mutex_lock(&vtep_list_mutex);
> > +	vtep = NULL;
> > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +		if (vlst->ifindex == ifindex) {
> > +			vtep = vlst;
> > +			break;
> > +		}
> > +	}
> 
> It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel instead
> of ifindex_tun which is same as vtep->ifindex like the assertion below. Then,
> this lookup can be skipped.

OK. Good optimization.

> 
> > +	if (!vtep) {
> > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > +		goto exit;
> > +	}
> > +	switch (dev_flow->tcf.tunnel->type) {
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > +		break;
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > +/*
> > + * TODO: Remove the encap ancillary rules first.
> > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> 
> Is it a personal note? Please remove.
OK.

> 
> > +		break;
> > +	default:
> > +		assert(false);
> > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > +		break;
> > +	}
> > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > +	assert(vtep->refcnt);
> > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > +		LIST_REMOVE(vtep, next);
> > +		flow_tcf_delete_iface(tcf, vtep);
> > +	}
> > +exit:
> > +	pthread_mutex_unlock(&vtep_list_mutex);
> > +}
> > +
> >  /**
> >   * Apply flow to E-Switch by sending Netlink message.
> >   *
> > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> >  	       struct rte_flow_error *error)  {
> >  	struct priv *priv = dev->data->dev_private;
> > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> >  	struct mlx5_flow *dev_flow;
> >  	struct nlmsghdr *nlh;
> > +	int ret;
> >
> >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> >  	/* E-Switch flow can't be expanded. */
> >  	assert(!LIST_NEXT(dev_flow, next));
> > +	if (dev_flow->tcf.applied)
> > +		return 0;
> >  	nlh = dev_flow->tcf.nlh;
> >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> NLM_F_EXCL;
> > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > +	if (dev_flow->tcf.tunnel) {
> > +		/*
> > +		 * Replace the interface index, target for
> > +		 * encapsulation, source for decapsulation.
> > +		 */
> > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > +		/* Create actual VTEP device when rule is being applied. */
> > +		dev_flow->tcf.tunnel->ifindex_tun
> > +			= flow_tcf_tunnel_vtep_create(tcf,
> > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > +					dev_flow, error);
> > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > +				dev_flow->tcf.tunnel->ifindex_tun,
> > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > +			return -rte_errno;
> > +		dev_flow->tcf.tunnel->ifindex_org
> > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > +	}
> > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +	if (dev_flow->tcf.tunnel) {
> > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > +				dev_flow->tcf.tunnel->ifindex_org,
> > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_org;
> > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> 
> ifindex_org looks a temporary storage in this code. And this kind of hassle
> (replace/restore) is there because you took the ifindex from the netlink
> message. Why don't you have just
> 
> struct mlx5_flow_tcf_tunnel_hdr {
> 	uint32_t type; /**< Tunnel action type. */
> 	unsigned int ifindex; /**< Original dst/src interface */
> 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> */ };
> 
> and don't change ifindex?

I propose to use the local variable for ifindex_org and do not keep it
in structure. *ifindex_ptr will keep.

With best regards,
Slava

> 
> Thanks,
> Yongseok
> 
> > +	}
> > +	if (!ret) {
> > +		dev_flow->tcf.applied = 1;
> >  		return 0;
> > +	}
> > +	DRV_LOG(WARNING, "netlink: failed to create TC rule (%d)",
> rte_errno);
> > +	if (dev_flow->tcf.tunnel->ifindex_tun) {
> > +		flow_tcf_tunnel_vtep_delete(tcf,
> > +					    dev_flow->tcf.tunnel->ifindex_tun,
> > +					    dev_flow);
> > +		dev_flow->tcf.tunnel->ifindex_tun = 0;
> > +	}
> >  	return rte_flow_error_set(error, rte_errno,
> >  				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> >  				  "netlink: failed to create TC flow rule"); @@
> -3490,7 +3959,7
> > @@ struct pedit_parser {  flow_tcf_remove(struct rte_eth_dev *dev,
> > struct rte_flow *flow)  {
> >  	struct priv *priv = dev->data->dev_private;
> > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> >  	struct mlx5_flow *dev_flow;
> >  	struct nlmsghdr *nlh;
> >
> > @@ -3501,10 +3970,36 @@ struct pedit_parser {
> >  		return;
> >  	/* E-Switch flow can't be expanded. */
> >  	assert(!LIST_NEXT(dev_flow, next));
> > +	if (!dev_flow->tcf.applied)
> > +		return;
> > +	if (dev_flow->tcf.tunnel) {
> > +		/*
> > +		 * Replace the interface index, target for
> > +		 * encapsulation, source for decapsulation.
> > +		 */
> > +		assert(dev_flow->tcf.tunnel->ifindex_tun);
> > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > +		dev_flow->tcf.tunnel->ifindex_org
> > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > +	}
> >  	nlh = dev_flow->tcf.nlh;
> >  	nlh->nlmsg_type = RTM_DELTFILTER;
> >  	nlh->nlmsg_flags = NLM_F_REQUEST;
> > -	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
> > +	flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +	if (dev_flow->tcf.tunnel) {
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_org;
> > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > +		if (dev_flow->tcf.tunnel->ifindex_tun) {
> > +			flow_tcf_tunnel_vtep_delete(tcf,
> > +					dev_flow->tcf.tunnel->ifindex_tun,
> > +					dev_flow);
> > +			dev_flow->tcf.tunnel->ifindex_tun = 0;
> > +		}
> > +	}
> > +	dev_flow->tcf.applied = 0;
> >  }
> >
> >  /**
> >
  
Yongseok Koh Oct. 26, 2018, 6:25 a.m. UTC | #3
On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Thursday, October 25, 2018 3:28
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> > 
> > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > > VXLAN interfaces are dynamically created for each local UDP port of
> > > outer networks and then used as targets for TC "flower" filters in
> > > order to perform encapsulation. These VXLAN interfaces are
> > > system-wide, the only one device with given UDP port can exist in the
> > > system (the attempt of creating another device with the same UDP local
> > > port returns EEXIST), so PMD should support the shared device
> > > instances database for PMD instances. These VXLAN implicitly created
> > > devices are called VTEPs (Virtual Tunnel End Points).
> > >
> > > Creation of the VTEP occurs at the moment of rule applying. The link
> > > is set up, root ingress qdisc is also initialized.
> > >
> > > Encapsulation VTEPs are created on per port basis, the single VTEP is
> > > attached to the outer interface and is shared for all encapsulation
> > > rules on this interface. The source UDP port is automatically selected
> > > in range 30000-60000.
> > >
> > > For decapsulaton one VTEP is created per every unique UDP local port
> > > to accept tunnel traffic. The name of created VTEP consists of prefix
> > > "vmlx_" and the number of UDP port in decimal digits without leading
> > > zeros (vmlx_4789). The VTEP can be preliminary created in the system
> > > before the launching
> > > application, it allows to share	UDP ports between primary
> > > and secondary processes.
> > >
> > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > ---
> > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > ++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > index d6840d5..efa9c3b 100644
> > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > >  	return -err;
> > >  }
> > >
> > > +/* VTEP device list is shared between PMD port instances. */ static
> > > +LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> > pthread_mutex_t
> > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > 
> > What's the reason for choosing pthread_mutex instead of rte_*_lock?
> 
> The sharing this database for secondary processes?

The static variable isn't shared with sec proc. But you can leave it as is.

> > > +
> > > +/**
> > > + * Deletes VTEP network device.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] vtep
> > > + *   Object represinting the network device to delete. Memory
> > > + *   allocated for this object is freed by routine.
> > > + */
> > > +static void
> > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > +	struct nlmsghdr *nlh;
> > > +	struct ifinfomsg *ifm;
> > > +	alignas(struct nlmsghdr)
> > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > +	int ret;
> > > +
> > > +	assert(!vtep->refcnt);
> > > +	if (vtep->created && vtep->ifindex) {
> > 
> > First of all vtep->created seems of no use. It is introduced to select the error
> > message in flow_tcf_create_iface(). I don't see any necessity to distinguish
> > between 'vtep is allocated by rte_malloc()' and 'vtep is created in kernel'.
> 
> created flag indicates the iface is created by our code.
> The VXLAN decap devices must have the specified UDP port, we can not create
> multiple VXLAN devices with the same UDP port - EEXIST is returned. So, we have
> to share device. One option is create device before DPDK application launch and use
> these pre-created devices. Inthis case created flag is not set and VXLAN device
> is not reinitialized, and not deleted.

I can't see any code to use pre-created device (created even before dpdk app
launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment in
[7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on
initialization. Your comment sounds conflicting and confusing.

> > And why do you need to check vtep->ifindex as well? If vtep is created in
> > kernel and its ifindex isn't set, that should be an error which had to be hanled
> > in flow_tcf_create_iface(). Such a vtep shouldn't exist.
> Yes, if we did not get ifindex of device - vtep is not created, error returned.
> We just can not operate w/o ifindex.

I know ifindex is needed but my question was checking vtep->ifindex here looked
redundant/unnecessary. But as you agreed on having create/get/release_iface(),
it doesn't matter much.

> > Also, the refcnt management is a bit strange. Please put an abstraction by
> > adding create_iface(), get_iface() and release_iface(). In the get_ifce(),
> > vtep->refcnt should be incremented. And in the release_iface(), it
> > vtep->decrease the
> OK. Good proposal. I'll refactor the code.
> 
> > refcnt and if it reaches to zero, the iface can be removed. create_iface() will
> > set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does
> > searching the list not by repeating the same lookup code here and there.
> > That will make your code much simpler.
> > 
> > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > +		nlh = mnl_nlmsg_put_header(buf);
> > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +		ifm->ifi_family = AF_UNSPEC;
> > > +		ifm->ifi_index = vtep->ifindex;
> > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +		if (ret)
> > > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> > "
> > > +					 "encap/decap ifindex %u",
> > > +					 ifm->ifi_index);
> > > +	}
> > > +	rte_free(vtep);
> > > +}
> > > +
> > > +/**
> > > + * Creates VTEP network device.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Outer interface to attach new-created VXLAN device
> > > + *   If zero the VXLAN device will not be attached to any device.
> > > + * @param[in] port
> > > + *   UDP port of created VTEP device.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + *
> > > + * @return
> > > + * Pointer to created device structure on success, NULL otherwise
> > > + * and rte_errno is set.
> > > + */
> > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > 
> > Why negative(ifndef) first intead of positive(ifdef)?
> Hm. Did I miss the rule. Positive #ifdef first? OK.

No concrete rule but if there's no specific reason, it would be better to start
from ifdef.

> > > +static struct mlx5_flow_tcf_vtep*
> > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
> > > +		      unsigned int ifouter __rte_unused,
> > > +		      uint16_t port __rte_unused,
> > > +		      struct rte_flow_error *error) {
> > > +	rte_flow_error_set(error, ENOTSUP,
> > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > +			 "netlink: failed to create VTEP, "
> > > +			 "VXLAN metadat is not supported by kernel");
> > 
> > Typo.
> 
> OK.  "metadata are not supported".
> > 
> > > +	return NULL;
> > > +}
> > > +#else
> > > +static struct mlx5_flow_tcf_vtep*
> > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> > 
> > How about adding 'vtep'? It sounds vague - creating a general interface.
> > E.g., flow_tcf_create_vtep_iface()?
> 
> OK.
> 
> > 
> > > +		      unsigned int ifouter,
> > > +		      uint16_t port, struct rte_flow_error *error) {
> > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > +	struct nlmsghdr *nlh;
> > > +	struct ifinfomsg *ifm;
> > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > +	alignas(struct nlmsghdr)
> > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > 
> > Use a macro for '128'. Can't know the meaning.
> OK. I think we should calculate the buffer size explicitly.
> 
> > 
> > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > +		       SZ_NLATTR_NEST * 2 +
> > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > +	struct nlattr *na_info;
> > > +	struct nlattr *na_vxlan;
> > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > 
> > Use rte_cpu_to_be_*() instead.
> 
> Yes, I'll recheck the whole code for this issue.
> 
> > 
> > > +	int ret;
> > > +
> > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > +	if (!vtep) {
> > > +		rte_flow_error_set
> > > +			(error, ENOMEM,
> > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > +			 NULL, "unadble to allocate memory for VTEP desc");
> > > +		return NULL;
> > > +	}
> > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > +			.refcnt = 0,
> > > +			.port = port,
> > > +			.created = 0,
> > > +			.ifouter = 0,
> > > +			.ifindex = 0,
> > > +			.local = LIST_HEAD_INITIALIZER(),
> > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > +	};
> > > +	memset(buf, 0, sizeof(buf));
> > > +	nlh = mnl_nlmsg_put_header(buf);
> > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > NLM_F_EXCL;
> > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +	ifm->ifi_family = AF_UNSPEC;
> > > +	ifm->ifi_type = 0;
> > > +	ifm->ifi_index = 0;
> > > +	ifm->ifi_flags = IFF_UP;
> > > +	ifm->ifi_change = 0xffffffff;
> > > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> > port);
> > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > +	assert(na_info);
> > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > +	if (ifouter)
> > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > +	assert(na_vxlan);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > +	mnl_attr_nest_end(nlh, na_info);
> > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +	if (ret)
> > > +		DRV_LOG(WARNING,
> > > +			"netlink: VTEP %s create failure (%d)",
> > > +			name, rte_errno);
> > > +	else
> > > +		vtep->created = 1;
> > 
> > Flow of code here isn't smooth, thus could be error-prone. Most of all, I
> > don't like ret has multiple meanings. ret should be return value but you are
> > using it to store ifindex.
> > 
> > > +	if (ret && ifouter)
> > > +		ret = 0;
> > > +	else
> > > +		ret = if_nametoindex(name);
> > 
> > If vtep isn't created and ifouter is set, then skip init below, which means, if
> 
> ifouter is set for VXLAN encap devices. They should be attached to ifouter
> and can not be shared. So, if ifouter I set - we do not use the precreated/existing
> VXLAN devices. We have to create our own not shared device.

In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows. Do
you mean it isn't shared between different outer ifaces? If so, that's for sure.

> > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > But why do you want to try to call this API even if it failed to create vtep?
> > Let's not make code flow convoluted even though it logically works. Let's
> > make it straightforward.
> > 
> > > +	if (ret) {
> > > +		vtep->ifindex = ret;
> > > +		vtep->ifouter = ifouter;
> > > +		memset(buf, 0, sizeof(buf));
> > > +		nlh = mnl_nlmsg_put_header(buf);
> > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +		ifm->ifi_family = AF_UNSPEC;
> > > +		ifm->ifi_type = 0;
> > > +		ifm->ifi_index = vtep->ifindex;
> > > +		ifm->ifi_flags = IFF_UP;
> > > +		ifm->ifi_change = IFF_UP;
> > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +		if (ret) {
> > > +			DRV_LOG(WARNING,
> > > +				"netlink: VTEP %s set link up failure (%d)",
> > > +				name, rte_errno);
> > > +			rte_free(vtep);
> > > +			rte_flow_error_set
> > > +				(error, -errno,
> > > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				 "netlink: failed to set VTEP link up");
> > > +			vtep = NULL;
> > > +		} else {
> > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > > +			if (ret)
> > > +				DRV_LOG(WARNING,
> > > +				"VTEP %s init failure (%d)", name, rte_errno);
> > > +		}
> > > +	} else {
> > > +		DRV_LOG(WARNING,
> > > +			"VTEP %s failed to get index (%d)", name, errno);
> > > +		rte_flow_error_set
> > > +			(error, -errno,
> > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > +			 ret = 1;
> > 
> > If it fails to create a vtep above, it will print out two warning messages and
> > one rte_flow_error message. And it even selects message to print between
> > two?
> > And there's another info msg at the end even in case of failure. Do you really
> > want to do this even with manipulating ret to change code path?  Not a good
> > practice.
> > 
> > Usually, code path should be straightforward for sucessful path and for
> > errors/failures, return immediately or use 'goto' if there's need for cleanup.
> > 
> > Please refactor entire function.
> 
> I think I'll split it in two ones - for attached and potentially shared ifaces.
> > 
> > > +	}
> > > +	if (ret) {
> > > +		flow_tcf_delete_iface(tcf, vtep);
> > > +		vtep = NULL;
> > > +	}
> > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> > "error");
> > > +	return vtep;
> > > +}
> > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > +
> > > +/**
> > > + * Create target interface index for VXLAN tunneling decapsulation.
> > > + * In order to share the UDP port within the other interfaces the
> > > + * VXLAN device created as not attached to any interface (if created).
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > 
> > Return negative errno in case of failure like others.
> 
> Anyway, we have to return an index. If we do not return it as function result
> we will need to provide some extra pointing parameter, it complicates the code.

You misunderstood it. See what I wrote below. The function still returns the
index but in case of error, make it return negative errno instead of zero.

> > 
> >  *   Interface index on success, a negative errno value otherwise and
> > rte_errno is set.
> > 
> > > + */
> > > +static unsigned int
> > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			   struct mlx5_flow *dev_flow,
> > > +			   struct rte_flow_error *error)
> > > +{
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > +
> > > +	vtep = NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->port == port) {
> > > +			vtep = vlst;
> > > +			break;
> > > +		}
> > > +	}
> > 
> > You just need one variable.
> 
> Yes. There is a long story, I forgot to revert code to one variable after debugging.
> > 
> > 	struct mlx5_flow_tcf_vtep *vtep;
> > 
> > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > 		if (vtep->port == port)
> > 			break;
> > 	}
> > 
> > > +	if (!vtep) {
> > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > +		if (vtep)
> > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > +	} else {
> > > +		if (vtep->ifouter) {
> > > +			rte_flow_error_set(error, -errno,
> > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				"Failed to create decap VTEP, attached "
> > > +				"device with the same UDP port exists");
> > > +				vtep = NULL;
> > 
> > Making vtep null to skip the following code?
> 
> Yes. To avoid multiple return operators in code.

It's okay to have multiple returns. Why not?

> > Please merge the two same
> > if/else and make the code path strightforward. And which errno do you
> > expect here?
> > Should it be set EEXIST instead?
> Not always. Netlink returns the code. 

No, that's not my point. Your code above sets errno instead of rte_errno or
EEXIST.

	} else {
		if (vtep->ifouter) {
			rte_flow_error_set(error, -errno,

Which one sets this errno? Here, it sets rte_errno because matched vtep can't be
used as it already has outer iface attached (error message isn't clear, please
reword it too). I thought this should be EEXIST but you set errno to rte_errno
but errno isn't valid at this point.

> 
> > 
> > > +		}
> > > +	}
> > > +	if (vtep) {
> > > +		vtep->refcnt++;
> > > +		assert(vtep->ifindex);
> > > +		return vtep->ifindex;
> > > +	} else {
> > > +		return 0;
> > > +	}
> > 
> > Why repeating same if/else?
> > 
> > 
> > This is my suggestion but if you take my suggestion to have
> > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> Agree.
> 
> > 
> > {
> > 	struct mlx5_flow_tcf_vtep *vtep;
> > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > 
> > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > 		if (vtep->port == port)
> > 			break;
> > 	}
> > 	if (vtep && vtep->ifouter)
> > 		return rte_flow_error_set(... EEXIST ...);
> > 	else if (vtep) {
> > 		++vtep->refcnt;
> > 	} else {
> > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > 		if (!vtep)
> > 			return rte_flow_error_set(...);
> > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > 	}
> > 	assert(vtep->ifindex);
> > 	return vtep->ifindex;
> > }
> > 
> > 
> > > +}
> > > +
> > > +/**
> > > + * Creates target interface index for VXLAN tunneling encapsulation.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Network interface index to attach VXLAN encap device to.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > + */
> > > +static unsigned int
> > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifouter,
> > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > +			    struct rte_flow_error *error)
> > > +{
> > > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +
> > > +	assert(ifouter);
> > > +	/* Look whether the attached VTEP for encap is created. */
> > > +	vtep = NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->ifouter == ifouter) {
> > > +			vtep = vlst;
> > > +			break;
> > > +		}
> > > +	}
> > 
> > Same here.
> > 
> > > +	if (!vtep) {
> > > +		uint16_t pcnt;
> > > +
> > > +		/* Not found, we should create the new attached VTEP. */
> > > +/*
> > > + * TODO: not implemented yet
> > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > 
> > Personal note is not appropriate even though it is removed in the following
> > patch.
> > 
> > > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> > pcnt++) {
> > > +			encap_port++;
> > > +			/* Wraparound the UDP port index. */
> > > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> > ||
> > > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > > +				encap_port =
> > MLX5_VXLAN_PORT_RANGE_MIN;
> > > +			/* Check whether UDP port is in already in use. */
> > > +			vtep = NULL;
> > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +				if (vlst->port == encap_port) {
> > > +					vtep = vlst;
> > > +					break;
> > > +				}
> > > +			}
> > 
> > If you want to find out an empty port number, you can use rte_bitmap
> > instead of repeating searching the entire list for all possible port numbers.
> 
> We do not expect too many VXLAN devices have been created. bitmap.

+1, valid point.

> > > +			if (vtep) {
> > > +				vtep = NULL;
> > > +				continue;
> > > +			}
> > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > +						     encap_port, error);
> > > +			if (vtep) {
> > > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > next);
> > > +				break;
> > > +			}
> > > +			if (rte_errno != EEXIST)
> > > +				break;
> > > +		}
> > > +	}
> > > +	if (!vtep)
> > > +		return 0;
> > > +	vtep->refcnt++;
> > > +	assert(vtep->ifindex);
> > > +	return vtep->ifindex;
> > 
> > Please refactor this func according to what I suggested for
> > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > 
> > > +}
> > > +
> > > +/**
> > > + * Creates target interface index for tunneling of any type.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Network interface index to attach VXLAN encap device to.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > 
> >  *   Interface index on success, a negative errno value otherwise and
> >  *   rte_errno is set.
> > 
> > > + */
> > > +static unsigned int
> > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifouter,
> > > +			    struct mlx5_flow *dev_flow,
> > > +			    struct rte_flow_error *error)
> > > +{
> > > +	unsigned int ret;
> > > +
> > > +	assert(dev_flow->tcf.tunnel);
> > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > +	switch (dev_flow->tcf.tunnel->type) {
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > +						 dev_flow, error);
> > > +		break;
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > > +		break;
> > > +	default:
> > > +		rte_flow_error_set(error, ENOTSUP,
> > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				"unsupported tunnel type");
> > > +		ret = 0;
> > > +		break;
> > > +	}
> > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * Deletes tunneling interface by UDP port.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifindex
> > > + *   Network interface index of VXLAN device.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + */
> > > +static void
> > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifindex,
> > > +			    struct mlx5_flow *dev_flow)
> > > +{
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +
> > > +	assert(dev_flow->tcf.tunnel);
> > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > +	vtep = NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->ifindex == ifindex) {
> > > +			vtep = vlst;
> > > +			break;
> > > +		}
> > > +	}
> > 
> > It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel instead
> > of ifindex_tun which is same as vtep->ifindex like the assertion below. Then,
> > this lookup can be skipped.
> 
> OK. Good optimization.
> 
> > 
> > > +	if (!vtep) {
> > > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > > +		goto exit;
> > > +	}
> > > +	switch (dev_flow->tcf.tunnel->type) {
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > +		break;
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > +/*
> > > + * TODO: Remove the encap ancillary rules first.
> > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> > 
> > Is it a personal note? Please remove.
> OK.
> 
> > 
> > > +		break;
> > > +	default:
> > > +		assert(false);
> > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > +		break;
> > > +	}
> > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > +	assert(vtep->refcnt);
> > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > +		LIST_REMOVE(vtep, next);
> > > +		flow_tcf_delete_iface(tcf, vtep);
> > > +	}
> > > +exit:
> > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > +}
> > > +
> > >  /**
> > >   * Apply flow to E-Switch by sending Netlink message.
> > >   *
> > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > >  	       struct rte_flow_error *error)  {
> > >  	struct priv *priv = dev->data->dev_private;
> > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > >  	struct mlx5_flow *dev_flow;
> > >  	struct nlmsghdr *nlh;
> > > +	int ret;
> > >
> > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > >  	/* E-Switch flow can't be expanded. */
> > >  	assert(!LIST_NEXT(dev_flow, next));
> > > +	if (dev_flow->tcf.applied)
> > > +		return 0;
> > >  	nlh = dev_flow->tcf.nlh;
> > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > NLM_F_EXCL;
> > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > +	if (dev_flow->tcf.tunnel) {
> > > +		/*
> > > +		 * Replace the interface index, target for
> > > +		 * encapsulation, source for decapsulation.
> > > +		 */
> > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		/* Create actual VTEP device when rule is being applied. */
> > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > > +					dev_flow, error);
> > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > +			return -rte_errno;
> > > +		dev_flow->tcf.tunnel->ifindex_org
> > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > +	}
> > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +	if (dev_flow->tcf.tunnel) {
> > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > 
> > ifindex_org looks a temporary storage in this code. And this kind of hassle
> > (replace/restore) is there because you took the ifindex from the netlink
> > message. Why don't you have just
> > 
> > struct mlx5_flow_tcf_tunnel_hdr {
> > 	uint32_t type; /**< Tunnel action type. */
> > 	unsigned int ifindex; /**< Original dst/src interface */
> > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > */ };
> > 
> > and don't change ifindex?
> 
> I propose to use the local variable for ifindex_org and do not keep it
> in structure. *ifindex_ptr will keep.

Well, you still have to restore the ifindex whenever sending the nl msg. Most of
all, ifindex_ptr in nl msg isn't a right place to store the ifindex. It should
have vtep ifindex but it just temporarily keeps the device ifindex until vtep is
created/found.

Thanks,
Yongseok
  
Slava Ovsiienko Oct. 26, 2018, 9:35 a.m. UTC | #4
> -----Original Message-----
> From: Yongseok Koh
> Sent: Friday, October 26, 2018 9:26
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> management
> 
> On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > -----Original Message-----
> > > From: Yongseok Koh
> > > Sent: Thursday, October 25, 2018 3:28
> > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > management
> > >
> > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > > > VXLAN interfaces are dynamically created for each local UDP port
> > > > of outer networks and then used as targets for TC "flower" filters
> > > > in order to perform encapsulation. These VXLAN interfaces are
> > > > system-wide, the only one device with given UDP port can exist in
> > > > the system (the attempt of creating another device with the same
> > > > UDP local port returns EEXIST), so PMD should support the shared
> > > > device instances database for PMD instances. These VXLAN
> > > > implicitly created devices are called VTEPs (Virtual Tunnel End Points).
> > > >
> > > > Creation of the VTEP occurs at the moment of rule applying. The
> > > > link is set up, root ingress qdisc is also initialized.
> > > >
> > > > Encapsulation VTEPs are created on per port basis, the single VTEP
> > > > is attached to the outer interface and is shared for all
> > > > encapsulation rules on this interface. The source UDP port is
> > > > automatically selected in range 30000-60000.
> > > >
> > > > For decapsulaton one VTEP is created per every unique UDP local
> > > > port to accept tunnel traffic. The name of created VTEP consists
> > > > of prefix "vmlx_" and the number of UDP port in decimal digits
> > > > without leading zeros (vmlx_4789). The VTEP can be preliminary
> > > > created in the system before the launching
> > > > application, it allows to share	UDP ports between primary
> > > > and secondary processes.
> > > >
> > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > ---
> > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > ++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > index d6840d5..efa9c3b 100644
> > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > >  	return -err;
> > > >  }
> > > >
> > > > +/* VTEP device list is shared between PMD port instances. */
> > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> > > pthread_mutex_t
> > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > >
> > > What's the reason for choosing pthread_mutex instead of rte_*_lock?
> >
> > The sharing this database for secondary processes?
> 
> The static variable isn't shared with sec proc. But you can leave it as is.

Yes. The sharing just was assumed, not implemented yet.

> 
> > > > +
> > > > +/**
> > > > + * Deletes VTEP network device.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] vtep
> > > > + *   Object represinting the network device to delete. Memory
> > > > + *   allocated for this object is freed by routine.
> > > > + */
> > > > +static void
> > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > +	struct nlmsghdr *nlh;
> > > > +	struct ifinfomsg *ifm;
> > > > +	alignas(struct nlmsghdr)
> > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > +	int ret;
> > > > +
> > > > +	assert(!vtep->refcnt);
> > > > +	if (vtep->created && vtep->ifindex) {
> > >
> > > First of all vtep->created seems of no use. It is introduced to
> > > select the error message in flow_tcf_create_iface(). I don't see any
> > > necessity to distinguish between 'vtep is allocated by rte_malloc()' and
> 'vtep is created in kernel'.
> >
> > created flag indicates the iface is created by our code.
> > The VXLAN decap devices must have the specified UDP port, we can not
> > create multiple VXLAN devices with the same UDP port - EEXIST is
> > returned. So, we have to share device. One option is create device
> > before DPDK application launch and use these pre-created devices.
> > Inthis case created flag is not set and VXLAN device is not reinitialized, and
> not deleted.
> 
> I can't see any code to use pre-created device (created even before dpdk app
> launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment
> in [7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on
> initialization. Your comment sounds conflicting and confusing.

There are two types of VXLAN devices:

- VXLAN decap, not attached to any ifouter. Provides the ingress UDP port,
 we try to share the devices of this type, because we may be asked for
 the specified UDP port. No device/rule cleanup and reinit needed.

- VXLAN encap, should be attached to ifouter to provide strict egress path,
no need to share - egress UDP port does not matter. And we need to cleanup ifouter,
remove other attached VXLAN devices and rules, because it is too hard to
co-exist with some pre-created setup.. 

> 
> > > And why do you need to check vtep->ifindex as well? If vtep is
> > > created in kernel and its ifindex isn't set, that should be an error
> > > which had to be hanled in flow_tcf_create_iface(). Such a vtep shouldn't
> exist.
> > Yes, if we did not get ifindex of device - vtep is not created, error returned.
> > We just can not operate w/o ifindex.
> 
> I know ifindex is needed but my question was checking vtep->ifindex here
> looked redundant/unnecessary. But as you agreed on having
> create/get/release_iface(), it doesn't matter much.

Yes. I agree, will refactor the code.

> 
> > > Also, the refcnt management is a bit strange. Please put an
> > > abstraction by adding create_iface(), get_iface() and
> > > release_iface(). In the get_ifce(),
> > > vtep->refcnt should be incremented. And in the release_iface(), it
> > > vtep->decrease the
> > OK. Good proposal. I'll refactor the code.
> >
> > > refcnt and if it reaches to zero, the iface can be removed.
> > > create_iface() will set the refcnt to 1. And if you refer to
> > > mlx5_hrxq_get(), it even does searching the list not by repeating the
> same lookup code here and there.
> > > That will make your code much simpler.
> > >
> > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > +		ifm->ifi_index = vtep->ifindex;
> > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +		if (ret)
> > > > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> > > "
> > > > +					 "encap/decap ifindex %u",
> > > > +					 ifm->ifi_index);
> > > > +	}
> > > > +	rte_free(vtep);
> > > > +}
> > > > +
> > > > +/**
> > > > + * Creates VTEP network device.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifouter
> > > > + *   Outer interface to attach new-created VXLAN device
> > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > + * @param[in] port
> > > > + *   UDP port of created VTEP device.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + *
> > > > + * @return
> > > > + * Pointer to created device structure on success, NULL otherwise
> > > > + * and rte_errno is set.
> > > > + */
> > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > >
> > > Why negative(ifndef) first intead of positive(ifdef)?
> > Hm. Did I miss the rule. Positive #ifdef first? OK.
> 
> No concrete rule but if there's no specific reason, it would be better to start
> from ifdef.
> 
> > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > +mlx5_flow_tcf_context *tcf __rte_unused,
> > > > +		      unsigned int ifouter __rte_unused,
> > > > +		      uint16_t port __rte_unused,
> > > > +		      struct rte_flow_error *error) {
> > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > +			 "netlink: failed to create VTEP, "
> > > > +			 "VXLAN metadat is not supported by kernel");
> > >
> > > Typo.
> >
> > OK.  "metadata are not supported".
> > >
> > > > +	return NULL;
> > > > +}
> > > > +#else
> > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > +mlx5_flow_tcf_context *tcf,
> > >
> > > How about adding 'vtep'? It sounds vague - creating a general interface.
> > > E.g., flow_tcf_create_vtep_iface()?
> >
> > OK.
> >
> > >
> > > > +		      unsigned int ifouter,
> > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > +	struct nlmsghdr *nlh;
> > > > +	struct ifinfomsg *ifm;
> > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > +	alignas(struct nlmsghdr)
> > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > >
> > > Use a macro for '128'. Can't know the meaning.
> > OK. I think we should calculate the buffer size explicitly.
> >
> > >
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > +		       SZ_NLATTR_NEST * 2 +
> > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > +	struct nlattr *na_info;
> > > > +	struct nlattr *na_vxlan;
> > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > >
> > > Use rte_cpu_to_be_*() instead.
> >
> > Yes, I'll recheck the whole code for this issue.
> >
> > >
> > > > +	int ret;
> > > > +
> > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > +	if (!vtep) {
> > > > +		rte_flow_error_set
> > > > +			(error, ENOMEM,
> > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > +			 NULL, "unadble to allocate memory for VTEP desc");
> > > > +		return NULL;
> > > > +	}
> > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > +			.refcnt = 0,
> > > > +			.port = port,
> > > > +			.created = 0,
> > > > +			.ifouter = 0,
> > > > +			.ifindex = 0,
> > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > +	};
> > > > +	memset(buf, 0, sizeof(buf));
> > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > NLM_F_EXCL;
> > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > +	ifm->ifi_type = 0;
> > > > +	ifm->ifi_index = 0;
> > > > +	ifm->ifi_flags = IFF_UP;
> > > > +	ifm->ifi_change = 0xffffffff;
> > > > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> > > port);
> > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > +	assert(na_info);
> > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > +	if (ifouter)
> > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > +	assert(na_vxlan);
> > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +	if (ret)
> > > > +		DRV_LOG(WARNING,
> > > > +			"netlink: VTEP %s create failure (%d)",
> > > > +			name, rte_errno);
> > > > +	else
> > > > +		vtep->created = 1;
> > >
> > > Flow of code here isn't smooth, thus could be error-prone. Most of
> > > all, I don't like ret has multiple meanings. ret should be return
> > > value but you are using it to store ifindex.
> > >
> > > > +	if (ret && ifouter)
> > > > +		ret = 0;
> > > > +	else
> > > > +		ret = if_nametoindex(name);
> > >
> > > If vtep isn't created and ifouter is set, then skip init below,
> > > which means, if
> >
> > ifouter is set for VXLAN encap devices. They should be attached to
> > ifouter and can not be shared. So, if ifouter I set - we do not use
> > the precreated/existing VXLAN devices. We have to create our own not
> shared device.
> 
> In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows.
> Do you mean it isn't shared between different outer ifaces? If so, that's for
> sure.
Sorry, I do not understand the question.
VXLAN encap device is attached to ifouter and shared by all flows with this
ifouter. No multiple VXLAN devices are attached to the same ifouter, only one.
VXLAN decap device has no attached ifouter, so it can not share it.

> 
> > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > But why do you want to try to call this API even if it failed to create vtep?
> > > Let's not make code flow convoluted even though it logically works.
> > > Let's make it straightforward.
> > >
> > > > +	if (ret) {
> > > > +		vtep->ifindex = ret;
> > > > +		vtep->ifouter = ifouter;
> > > > +		memset(buf, 0, sizeof(buf));
> > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > +		ifm->ifi_type = 0;
> > > > +		ifm->ifi_index = vtep->ifindex;
> > > > +		ifm->ifi_flags = IFF_UP;
> > > > +		ifm->ifi_change = IFF_UP;
> > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +		if (ret) {
> > > > +			DRV_LOG(WARNING,
> > > > +				"netlink: VTEP %s set link up failure (%d)",
> > > > +				name, rte_errno);
> > > > +			rte_free(vtep);
> > > > +			rte_flow_error_set
> > > > +				(error, -errno,
> > > > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > NULL,
> > > > +				 "netlink: failed to set VTEP link up");
> > > > +			vtep = NULL;
> > > > +		} else {
> > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > > > +			if (ret)
> > > > +				DRV_LOG(WARNING,
> > > > +				"VTEP %s init failure (%d)", name, rte_errno);
> > > > +		}
> > > > +	} else {
> > > > +		DRV_LOG(WARNING,
> > > > +			"VTEP %s failed to get index (%d)", name, errno);
> > > > +		rte_flow_error_set
> > > > +			(error, -errno,
> > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > +			 ret = 1;
> > >
> > > If it fails to create a vtep above, it will print out two warning
> > > messages and one rte_flow_error message. And it even selects message
> > > to print between two?
> > > And there's another info msg at the end even in case of failure. Do
> > > you really want to do this even with manipulating ret to change code
> > > path?  Not a good practice.
> > >
> > > Usually, code path should be straightforward for sucessful path and
> > > for errors/failures, return immediately or use 'goto' if there's need for
> cleanup.
> > >
> > > Please refactor entire function.
> >
> > I think I'll split it in two ones - for attached and potentially shared ifaces.
> > >
> > > > +	}
> > > > +	if (ret) {
> > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > +		vtep = NULL;
> > > > +	}
> > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> > > "error");
> > > > +	return vtep;
> > > > +}
> > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > +
> > > > +/**
> > > > + * Create target interface index for VXLAN tunneling decapsulation.
> > > > + * In order to share the UDP port within the other interfaces the
> > > > + * VXLAN device created as not attached to any interface (if created).
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + * @return
> > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > >
> > > Return negative errno in case of failure like others.
> >
> > Anyway, we have to return an index. If we do not return it as function
> > result we will need to provide some extra pointing parameter, it
> complicates the code.
> 
> You misunderstood it. See what I wrote below. The function still returns the
> index but in case of error, make it return negative errno instead of zero.
> 
> > >
> > >  *   Interface index on success, a negative errno value otherwise and
> > > rte_errno is set.
> > >
> > > > + */
> > > > +static unsigned int
> > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > +			   struct mlx5_flow *dev_flow,
> > > > +			   struct rte_flow_error *error) {
> > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > +
> > > > +	vtep = NULL;
> > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +		if (vlst->port == port) {
> > > > +			vtep = vlst;
> > > > +			break;
> > > > +		}
> > > > +	}
> > >
> > > You just need one variable.
> >
> > Yes. There is a long story, I forgot to revert code to one variable after
> debugging.
> > >
> > > 	struct mlx5_flow_tcf_vtep *vtep;
> > >
> > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > 		if (vtep->port == port)
> > > 			break;
> > > 	}
> > >
> > > > +	if (!vtep) {
> > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > +		if (vtep)
> > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > +	} else {
> > > > +		if (vtep->ifouter) {
> > > > +			rte_flow_error_set(error, -errno,
> > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > NULL,
> > > > +				"Failed to create decap VTEP, attached "
> > > > +				"device with the same UDP port exists");
> > > > +				vtep = NULL;
> > >
> > > Making vtep null to skip the following code?
> >
> > Yes. To avoid multiple return operators in code.
> 
> It's okay to have multiple returns. Why not?

It is easy to miss the return in the midst of function  while refactoring/modifying the code.

> 
> > > Please merge the two same
> > > if/else and make the code path strightforward. And which errno do
> > > you expect here?
> > > Should it be set EEXIST instead?
> > Not always. Netlink returns the code.
> 
> No, that's not my point. Your code above sets errno instead of rte_errno or
> EEXIST.
> 
> 	} else {
> 		if (vtep->ifouter) {
> 			rte_flow_error_set(error, -errno,
> 
> Which one sets this errno? Here, it sets rte_errno because matched vtep
libmnl sets, while processing the Netlink reply message (callback.c of libmnl sources).

> can't be used as it already has outer iface attached (error message isn't clear,
> please reword it too). I thought this should be EEXIST but you set errno to
> rte_errno but errno isn't valid at this point.
> 
> >
> > >
> > > > +		}
> > > > +	}
> > > > +	if (vtep) {
> > > > +		vtep->refcnt++;
> > > > +		assert(vtep->ifindex);
> > > > +		return vtep->ifindex;
> > > > +	} else {
> > > > +		return 0;
> > > > +	}
> > >
> > > Why repeating same if/else?
> > >
> > >
> > > This is my suggestion but if you take my suggestion to have
> > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > Agree.
> >
> > >
> > > {
> > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > >
> > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > 		if (vtep->port == port)
> > > 			break;
> > > 	}
> > > 	if (vtep && vtep->ifouter)
> > > 		return rte_flow_error_set(... EEXIST ...);
> > > 	else if (vtep) {
> > > 		++vtep->refcnt;
> > > 	} else {
> > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > 		if (!vtep)
> > > 			return rte_flow_error_set(...);
> > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > 	}
> > > 	assert(vtep->ifindex);
> > > 	return vtep->ifindex;
> > > }
> > >
> > >
> > > > +}
> > > > +
> > > > +/**
> > > > + * Creates target interface index for VXLAN tunneling encapsulation.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifouter
> > > > + *   Network interface index to attach VXLAN encap device to.
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + * @return
> > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > + */
> > > > +static unsigned int
> > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > +			    unsigned int ifouter,
> > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > +			    struct rte_flow_error *error) {
> > > > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > +
> > > > +	assert(ifouter);
> > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > +	vtep = NULL;
> > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +		if (vlst->ifouter == ifouter) {
> > > > +			vtep = vlst;
> > > > +			break;
> > > > +		}
> > > > +	}
> > >
> > > Same here.
> > >
> > > > +	if (!vtep) {
> > > > +		uint16_t pcnt;
> > > > +
> > > > +		/* Not found, we should create the new attached VTEP. */
> > > > +/*
> > > > + * TODO: not implemented yet
> > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > >
> > > Personal note is not appropriate even though it is removed in the
> > > following patch.
> > >
> > > > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > > > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> > > pcnt++) {
> > > > +			encap_port++;
> > > > +			/* Wraparound the UDP port index. */
> > > > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> > > ||
> > > > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > > > +				encap_port =
> > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > +			/* Check whether UDP port is in already in use. */
> > > > +			vtep = NULL;
> > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +				if (vlst->port == encap_port) {
> > > > +					vtep = vlst;
> > > > +					break;
> > > > +				}
> > > > +			}
> > >
> > > If you want to find out an empty port number, you can use rte_bitmap
> > > instead of repeating searching the entire list for all possible port
> numbers.
> >
> > We do not expect too many VXLAN devices have been created. bitmap.
> 
> +1, valid point.
> 
> > > > +			if (vtep) {
> > > > +				vtep = NULL;
> > > > +				continue;
> > > > +			}
> > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > +						     encap_port, error);
> > > > +			if (vtep) {
> > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > > next);
> > > > +				break;
> > > > +			}
> > > > +			if (rte_errno != EEXIST)
> > > > +				break;
> > > > +		}
> > > > +	}
> > > > +	if (!vtep)
> > > > +		return 0;
> > > > +	vtep->refcnt++;
> > > > +	assert(vtep->ifindex);
> > > > +	return vtep->ifindex;
> > >
> > > Please refactor this func according to what I suggested for
> > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > >
> > > > +}
> > > > +
> > > > +/**
> > > > + * Creates target interface index for tunneling of any type.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifouter
> > > > + *   Network interface index to attach VXLAN encap device to.
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + * @return
> > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > >
> > >  *   Interface index on success, a negative errno value otherwise and
> > >  *   rte_errno is set.
> > >
> > > > + */
> > > > +static unsigned int
> > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > +			    unsigned int ifouter,
> > > > +			    struct mlx5_flow *dev_flow,
> > > > +			    struct rte_flow_error *error) {
> > > > +	unsigned int ret;
> > > > +
> > > > +	assert(dev_flow->tcf.tunnel);
> > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > +						 dev_flow, error);
> > > > +		break;
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > > > +		break;
> > > > +	default:
> > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > NULL,
> > > > +				"unsupported tunnel type");
> > > > +		ret = 0;
> > > > +		break;
> > > > +	}
> > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Deletes tunneling interface by UDP port.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifindex
> > > > + *   Network interface index of VXLAN device.
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + */
> > > > +static void
> > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > +			    unsigned int ifindex,
> > > > +			    struct mlx5_flow *dev_flow) {
> > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > +
> > > > +	assert(dev_flow->tcf.tunnel);
> > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > +	vtep = NULL;
> > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +		if (vlst->ifindex == ifindex) {
> > > > +			vtep = vlst;
> > > > +			break;
> > > > +		}
> > > > +	}
> > >
> > > It is weird. You just can have vtep pointer in the
> > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > vtep->ifindex like the assertion below. Then, this lookup can be skipped.
> >
> > OK. Good optimization.
> >
> > >
> > > > +	if (!vtep) {
> > > > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > > > +		goto exit;
> > > > +	}
> > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > +		break;
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > +/*
> > > > + * TODO: Remove the encap ancillary rules first.
> > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> > >
> > > Is it a personal note? Please remove.
> > OK.
> >
> > >
> > > > +		break;
> > > > +	default:
> > > > +		assert(false);
> > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > +		break;
> > > > +	}
> > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > +	assert(vtep->refcnt);
> > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > +		LIST_REMOVE(vtep, next);
> > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > +	}
> > > > +exit:
> > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > +}
> > > > +
> > > >  /**
> > > >   * Apply flow to E-Switch by sending Netlink message.
> > > >   *
> > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > >  	       struct rte_flow_error *error)  {
> > > >  	struct priv *priv = dev->data->dev_private;
> > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > >  	struct mlx5_flow *dev_flow;
> > > >  	struct nlmsghdr *nlh;
> > > > +	int ret;
> > > >
> > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > >  	/* E-Switch flow can't be expanded. */
> > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > +	if (dev_flow->tcf.applied)
> > > > +		return 0;
> > > >  	nlh = dev_flow->tcf.nlh;
> > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > NLM_F_EXCL;
> > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > +	if (dev_flow->tcf.tunnel) {
> > > > +		/*
> > > > +		 * Replace the interface index, target for
> > > > +		 * encapsulation, source for decapsulation.
> > > > +		 */
> > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > +		/* Create actual VTEP device when rule is being applied. */
> > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > > > +					dev_flow, error);
> > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > +			return -rte_errno;
> > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > +	}
> > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +	if (dev_flow->tcf.tunnel) {
> > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > >
> > > ifindex_org looks a temporary storage in this code. And this kind of
> > > hassle
> > > (replace/restore) is there because you took the ifindex from the
> > > netlink message. Why don't you have just
> > >
> > > struct mlx5_flow_tcf_tunnel_hdr {
> > > 	uint32_t type; /**< Tunnel action type. */
> > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > */ };
> > >
> > > and don't change ifindex?
> >
> > I propose to use the local variable for ifindex_org and do not keep it
> > in structure. *ifindex_ptr will keep.
> 
> Well, you still have to restore the ifindex whenever sending the nl msg. Most
> of all, ifindex_ptr in nl msg isn't a right place to store the ifindex. 
It is stored there for rules w/o tunnels. It is its "native" place, Id prefer
not to create some new location to store the original index and save some space.
We have to swap indices only if rule has requested the tunneling.  We can not
set tunnel index permanently, because rule can be applied/removed/reapplied
and other new VXLAN device with new index can be recreated.

> have vtep ifindex but it just temporarily keeps the device ifindex until vtep is
> created/found.
> 
> Thanks,
> Yongseok

With best regards,
Slava
  
Yongseok Koh Oct. 26, 2018, 10:42 p.m. UTC | #5
On Fri, Oct 26, 2018 at 02:35:24AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Friday, October 26, 2018 9:26
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> > 
> > On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Thursday, October 25, 2018 3:28
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > > management
> > > >
> > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > > > > VXLAN interfaces are dynamically created for each local UDP port
> > > > > of outer networks and then used as targets for TC "flower" filters
> > > > > in order to perform encapsulation. These VXLAN interfaces are
> > > > > system-wide, the only one device with given UDP port can exist in
> > > > > the system (the attempt of creating another device with the same
> > > > > UDP local port returns EEXIST), so PMD should support the shared
> > > > > device instances database for PMD instances. These VXLAN
> > > > > implicitly created devices are called VTEPs (Virtual Tunnel End Points).
> > > > >
> > > > > Creation of the VTEP occurs at the moment of rule applying. The
> > > > > link is set up, root ingress qdisc is also initialized.
> > > > >
> > > > > Encapsulation VTEPs are created on per port basis, the single VTEP
> > > > > is attached to the outer interface and is shared for all
> > > > > encapsulation rules on this interface. The source UDP port is
> > > > > automatically selected in range 30000-60000.
> > > > >
> > > > > For decapsulaton one VTEP is created per every unique UDP local
> > > > > port to accept tunnel traffic. The name of created VTEP consists
> > > > > of prefix "vmlx_" and the number of UDP port in decimal digits
> > > > > without leading zeros (vmlx_4789). The VTEP can be preliminary
> > > > > created in the system before the launching
> > > > > application, it allows to share	UDP ports between primary
> > > > > and secondary processes.
> > > > >
> > > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > > ---
> > > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > > ++++++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > index d6840d5..efa9c3b 100644
> > > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > > >  	return -err;
> > > > >  }
> > > > >
> > > > > +/* VTEP device list is shared between PMD port instances. */
> > > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> > > > pthread_mutex_t
> > > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > > >
> > > > What's the reason for choosing pthread_mutex instead of rte_*_lock?
> > >
> > > The sharing this database for secondary processes?
> > 
> > The static variable isn't shared with sec proc. But you can leave it as is.
> 
> Yes. The sharing just was assumed, not implemented yet.
> 
> > 
> > > > > +
> > > > > +/**
> > > > > + * Deletes VTEP network device.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] vtep
> > > > > + *   Object represinting the network device to delete. Memory
> > > > > + *   allocated for this object is freed by routine.
> > > > > + */
> > > > > +static void
> > > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > > +	struct nlmsghdr *nlh;
> > > > > +	struct ifinfomsg *ifm;
> > > > > +	alignas(struct nlmsghdr)
> > > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > > +	int ret;
> > > > > +
> > > > > +	assert(!vtep->refcnt);
> > > > > +	if (vtep->created && vtep->ifindex) {
> > > >
> > > > First of all vtep->created seems of no use. It is introduced to
> > > > select the error message in flow_tcf_create_iface(). I don't see any
> > > > necessity to distinguish between 'vtep is allocated by rte_malloc()' and
> > 'vtep is created in kernel'.
> > >
> > > created flag indicates the iface is created by our code.
> > > The VXLAN decap devices must have the specified UDP port, we can not
> > > create multiple VXLAN devices with the same UDP port - EEXIST is
> > > returned. So, we have to share device. One option is create device
> > > before DPDK application launch and use these pre-created devices.
> > > Inthis case created flag is not set and VXLAN device is not reinitialized, and
> > not deleted.
> > 
> > I can't see any code to use pre-created device (created even before dpdk app
> > launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment
> > in [7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on
> > initialization. Your comment sounds conflicting and confusing.
> 
> There are two types of VXLAN devices:
> 
> - VXLAN decap, not attached to any ifouter. Provides the ingress UDP port,
>  we try to share the devices of this type, because we may be asked for
>  the specified UDP port. No device/rule cleanup and reinit needed.
> 
> - VXLAN encap, should be attached to ifouter to provide strict egress path,
> no need to share - egress UDP port does not matter. And we need to cleanup ifouter,
> remove other attached VXLAN devices and rules, because it is too hard to
> co-exist with some pre-created setup.. 

I knew that. But how can it justify the need of 'created' field in vtep struct?
In this code, it is of no use. But will see how it is used in your v3.

> > > > And why do you need to check vtep->ifindex as well? If vtep is
> > > > created in kernel and its ifindex isn't set, that should be an error
> > > > which had to be hanled in flow_tcf_create_iface(). Such a vtep shouldn't
> > exist.
> > > Yes, if we did not get ifindex of device - vtep is not created, error returned.
> > > We just can not operate w/o ifindex.
> > 
> > I know ifindex is needed but my question was checking vtep->ifindex here
> > looked redundant/unnecessary. But as you agreed on having
> > create/get/release_iface(), it doesn't matter much.
> 
> Yes. I agree, will refactor the code.
> 
> > 
> > > > Also, the refcnt management is a bit strange. Please put an
> > > > abstraction by adding create_iface(), get_iface() and
> > > > release_iface(). In the get_ifce(),
> > > > vtep->refcnt should be incremented. And in the release_iface(), it
> > > > vtep->decrease the
> > > OK. Good proposal. I'll refactor the code.
> > >
> > > > refcnt and if it reaches to zero, the iface can be removed.
> > > > create_iface() will set the refcnt to 1. And if you refer to
> > > > mlx5_hrxq_get(), it even does searching the list not by repeating the
> > same lookup code here and there.
> > > > That will make your code much simpler.
> > > >
> > > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +		if (ret)
> > > > > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> > > > "
> > > > > +					 "encap/decap ifindex %u",
> > > > > +					 ifm->ifi_index);
> > > > > +	}
> > > > > +	rte_free(vtep);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Creates VTEP network device.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifouter
> > > > > + *   Outer interface to attach new-created VXLAN device
> > > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > > + * @param[in] port
> > > > > + *   UDP port of created VTEP device.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + *
> > > > > + * @return
> > > > > + * Pointer to created device structure on success, NULL otherwise
> > > > > + * and rte_errno is set.
> > > > > + */
> > > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > > >
> > > > Why negative(ifndef) first intead of positive(ifdef)?
> > > Hm. Did I miss the rule. Positive #ifdef first? OK.
> > 
> > No concrete rule but if there's no specific reason, it would be better to start
> > from ifdef.
> > 
> > > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > > +mlx5_flow_tcf_context *tcf __rte_unused,
> > > > > +		      unsigned int ifouter __rte_unused,
> > > > > +		      uint16_t port __rte_unused,
> > > > > +		      struct rte_flow_error *error) {
> > > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > > +			 "netlink: failed to create VTEP, "
> > > > > +			 "VXLAN metadat is not supported by kernel");
> > > >
> > > > Typo.
> > >
> > > OK.  "metadata are not supported".
> > > >
> > > > > +	return NULL;
> > > > > +}
> > > > > +#else
> > > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > > +mlx5_flow_tcf_context *tcf,
> > > >
> > > > How about adding 'vtep'? It sounds vague - creating a general interface.
> > > > E.g., flow_tcf_create_vtep_iface()?
> > >
> > > OK.
> > >
> > > >
> > > > > +		      unsigned int ifouter,
> > > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > > +	struct nlmsghdr *nlh;
> > > > > +	struct ifinfomsg *ifm;
> > > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > > +	alignas(struct nlmsghdr)
> > > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > > >
> > > > Use a macro for '128'. Can't know the meaning.
> > > OK. I think we should calculate the buffer size explicitly.
> > >
> > > >
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > > +		       SZ_NLATTR_NEST * 2 +
> > > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > > +	struct nlattr *na_info;
> > > > > +	struct nlattr *na_vxlan;
> > > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > > >
> > > > Use rte_cpu_to_be_*() instead.
> > >
> > > Yes, I'll recheck the whole code for this issue.
> > >
> > > >
> > > > > +	int ret;
> > > > > +
> > > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > > +	if (!vtep) {
> > > > > +		rte_flow_error_set
> > > > > +			(error, ENOMEM,
> > > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > +			 NULL, "unadble to allocate memory for VTEP desc");
> > > > > +		return NULL;
> > > > > +	}
> > > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > > +			.refcnt = 0,
> > > > > +			.port = port,
> > > > > +			.created = 0,
> > > > > +			.ifouter = 0,
> > > > > +			.ifindex = 0,
> > > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > > +	};
> > > > > +	memset(buf, 0, sizeof(buf));
> > > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > > NLM_F_EXCL;
> > > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > > +	ifm->ifi_type = 0;
> > > > > +	ifm->ifi_index = 0;
> > > > > +	ifm->ifi_flags = IFF_UP;
> > > > > +	ifm->ifi_change = 0xffffffff;
> > > > > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> > > > port);
> > > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > > +	assert(na_info);
> > > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > > +	if (ifouter)
> > > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > > +	assert(na_vxlan);
> > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +	if (ret)
> > > > > +		DRV_LOG(WARNING,
> > > > > +			"netlink: VTEP %s create failure (%d)",
> > > > > +			name, rte_errno);
> > > > > +	else
> > > > > +		vtep->created = 1;
> > > >
> > > > Flow of code here isn't smooth, thus could be error-prone. Most of
> > > > all, I don't like ret has multiple meanings. ret should be return
> > > > value but you are using it to store ifindex.
> > > >
> > > > > +	if (ret && ifouter)
> > > > > +		ret = 0;
> > > > > +	else
> > > > > +		ret = if_nametoindex(name);
> > > >
> > > > If vtep isn't created and ifouter is set, then skip init below,
> > > > which means, if
> > >
> > > ifouter is set for VXLAN encap devices. They should be attached to
> > > ifouter and can not be shared. So, if ifouter I set - we do not use
> > > the precreated/existing VXLAN devices. We have to create our own not
> > shared device.
> > 
> > In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows.
> > Do you mean it isn't shared between different outer ifaces? If so, that's for
> > sure.
> Sorry, I do not understand the question.
> VXLAN encap device is attached to ifouter and shared by all flows with this
> ifouter. No multiple VXLAN devices are attached to the same ifouter, only one.
> VXLAN decap device has no attached ifouter, so it can not share it.

Yep, that's what I meant.

> > > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > > But why do you want to try to call this API even if it failed to create vtep?
> > > > Let's not make code flow convoluted even though it logically works.
> > > > Let's make it straightforward.
> > > >
> > > > > +	if (ret) {
> > > > > +		vtep->ifindex = ret;
> > > > > +		vtep->ifouter = ifouter;
> > > > > +		memset(buf, 0, sizeof(buf));
> > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > +		ifm->ifi_type = 0;
> > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > +		ifm->ifi_flags = IFF_UP;
> > > > > +		ifm->ifi_change = IFF_UP;
> > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +		if (ret) {
> > > > > +			DRV_LOG(WARNING,
> > > > > +				"netlink: VTEP %s set link up failure (%d)",
> > > > > +				name, rte_errno);
> > > > > +			rte_free(vtep);
> > > > > +			rte_flow_error_set
> > > > > +				(error, -errno,
> > > > > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > NULL,
> > > > > +				 "netlink: failed to set VTEP link up");
> > > > > +			vtep = NULL;
> > > > > +		} else {
> > > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > > > > +			if (ret)
> > > > > +				DRV_LOG(WARNING,
> > > > > +				"VTEP %s init failure (%d)", name, rte_errno);
> > > > > +		}
> > > > > +	} else {
> > > > > +		DRV_LOG(WARNING,
> > > > > +			"VTEP %s failed to get index (%d)", name, errno);
> > > > > +		rte_flow_error_set
> > > > > +			(error, -errno,
> > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > > +			 ret = 1;
> > > >
> > > > If it fails to create a vtep above, it will print out two warning
> > > > messages and one rte_flow_error message. And it even selects message
> > > > to print between two?
> > > > And there's another info msg at the end even in case of failure. Do
> > > > you really want to do this even with manipulating ret to change code
> > > > path?  Not a good practice.
> > > >
> > > > Usually, code path should be straightforward for sucessful path and
> > > > for errors/failures, return immediately or use 'goto' if there's need for
> > cleanup.
> > > >
> > > > Please refactor entire function.
> > >
> > > I think I'll split it in two ones - for attached and potentially shared ifaces.
> > > >
> > > > > +	}
> > > > > +	if (ret) {
> > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > +		vtep = NULL;
> > > > > +	}
> > > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> > > > "error");
> > > > > +	return vtep;
> > > > > +}
> > > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > > +
> > > > > +/**
> > > > > + * Create target interface index for VXLAN tunneling decapsulation.
> > > > > + * In order to share the UDP port within the other interfaces the
> > > > > + * VXLAN device created as not attached to any interface (if created).
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + * @return
> > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > >
> > > > Return negative errno in case of failure like others.
> > >
> > > Anyway, we have to return an index. If we do not return it as function
> > > result we will need to provide some extra pointing parameter, it
> > complicates the code.
> > 
> > You misunderstood it. See what I wrote below. The function still returns the
> > index but in case of error, make it return negative errno instead of zero.
> > 
> > > >
> > > >  *   Interface index on success, a negative errno value otherwise and
> > > > rte_errno is set.
> > > >
> > > > > + */
> > > > > +static unsigned int
> > > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > +			   struct mlx5_flow *dev_flow,
> > > > > +			   struct rte_flow_error *error) {
> > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > +
> > > > > +	vtep = NULL;
> > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +		if (vlst->port == port) {
> > > > > +			vtep = vlst;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > >
> > > > You just need one variable.
> > >
> > > Yes. There is a long story, I forgot to revert code to one variable after
> > debugging.
> > > >
> > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > >
> > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > 		if (vtep->port == port)
> > > > 			break;
> > > > 	}
> > > >
> > > > > +	if (!vtep) {
> > > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > +		if (vtep)
> > > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > > +	} else {
> > > > > +		if (vtep->ifouter) {
> > > > > +			rte_flow_error_set(error, -errno,
> > > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > NULL,
> > > > > +				"Failed to create decap VTEP, attached "
> > > > > +				"device with the same UDP port exists");
> > > > > +				vtep = NULL;
> > > >
> > > > Making vtep null to skip the following code?
> > >
> > > Yes. To avoid multiple return operators in code.
> > 
> > It's okay to have multiple returns. Why not?
> 
> It is easy to miss the return in the midst of function  while refactoring/modifying the code.

Your code path doesn't look easy and free from error. Please refer to other
control path functions in this PMD.

> > > > Please merge the two same
> > > > if/else and make the code path strightforward. And which errno do
> > > > you expect here?
> > > > Should it be set EEXIST instead?
> > > Not always. Netlink returns the code.
> > 
> > No, that's not my point. Your code above sets errno instead of rte_errno or
> > EEXIST.
> > 
> > 	} else {
> > 		if (vtep->ifouter) {
> > 			rte_flow_error_set(error, -errno,
> > 
> > Which one sets this errno? Here, it sets rte_errno because matched vtep
> libmnl sets, while processing the Netlink reply message (callback.c of libmnl sources).

You still don't understand my point.

In this flow_tcf_decap_vtep_create(), if vtep is found (vtep != NULL), how can
errno be set? Before the if/else, there's no libmnl call.

> > can't be used as it already has outer iface attached (error message isn't clear,
> > please reword it too). I thought this should be EEXIST but you set errno to
> > rte_errno but errno isn't valid at this point.
> > 
> > >
> > > >
> > > > > +		}
> > > > > +	}
> > > > > +	if (vtep) {
> > > > > +		vtep->refcnt++;
> > > > > +		assert(vtep->ifindex);
> > > > > +		return vtep->ifindex;
> > > > > +	} else {
> > > > > +		return 0;
> > > > > +	}
> > > >
> > > > Why repeating same if/else?
> > > >
> > > >
> > > > This is my suggestion but if you take my suggestion to have
> > > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > > Agree.
> > >
> > > >
> > > > {
> > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > >
> > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > 		if (vtep->port == port)
> > > > 			break;
> > > > 	}
> > > > 	if (vtep && vtep->ifouter)
> > > > 		return rte_flow_error_set(... EEXIST ...);
> > > > 	else if (vtep) {
> > > > 		++vtep->refcnt;
> > > > 	} else {
> > > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > 		if (!vtep)
> > > > 			return rte_flow_error_set(...);
> > > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > 	}
> > > > 	assert(vtep->ifindex);
> > > > 	return vtep->ifindex;
> > > > }
> > > >
> > > >
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Creates target interface index for VXLAN tunneling encapsulation.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifouter
> > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + * @return
> > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > + */
> > > > > +static unsigned int
> > > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > +			    unsigned int ifouter,
> > > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > > +			    struct rte_flow_error *error) {
> > > > > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > +
> > > > > +	assert(ifouter);
> > > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > > +	vtep = NULL;
> > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +		if (vlst->ifouter == ifouter) {
> > > > > +			vtep = vlst;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > >
> > > > Same here.
> > > >
> > > > > +	if (!vtep) {
> > > > > +		uint16_t pcnt;
> > > > > +
> > > > > +		/* Not found, we should create the new attached VTEP. */
> > > > > +/*
> > > > > + * TODO: not implemented yet
> > > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > > >
> > > > Personal note is not appropriate even though it is removed in the
> > > > following patch.
> > > >
> > > > > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > > > > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> > > > pcnt++) {
> > > > > +			encap_port++;
> > > > > +			/* Wraparound the UDP port index. */
> > > > > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> > > > ||
> > > > > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > > > > +				encap_port =
> > > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > > +			/* Check whether UDP port is in already in use. */
> > > > > +			vtep = NULL;
> > > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +				if (vlst->port == encap_port) {
> > > > > +					vtep = vlst;
> > > > > +					break;
> > > > > +				}
> > > > > +			}
> > > >
> > > > If you want to find out an empty port number, you can use rte_bitmap
> > > > instead of repeating searching the entire list for all possible port
> > numbers.
> > >
> > > We do not expect too many VXLAN devices have been created. bitmap.
> > 
> > +1, valid point.
> > 
> > > > > +			if (vtep) {
> > > > > +				vtep = NULL;
> > > > > +				continue;
> > > > > +			}
> > > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > > +						     encap_port, error);
> > > > > +			if (vtep) {
> > > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > > > next);
> > > > > +				break;
> > > > > +			}
> > > > > +			if (rte_errno != EEXIST)
> > > > > +				break;
> > > > > +		}
> > > > > +	}
> > > > > +	if (!vtep)
> > > > > +		return 0;
> > > > > +	vtep->refcnt++;
> > > > > +	assert(vtep->ifindex);
> > > > > +	return vtep->ifindex;
> > > >
> > > > Please refactor this func according to what I suggested for
> > > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > > >
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Creates target interface index for tunneling of any type.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifouter
> > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + * @return
> > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > >
> > > >  *   Interface index on success, a negative errno value otherwise and
> > > >  *   rte_errno is set.
> > > >
> > > > > + */
> > > > > +static unsigned int
> > > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > +			    unsigned int ifouter,
> > > > > +			    struct mlx5_flow *dev_flow,
> > > > > +			    struct rte_flow_error *error) {
> > > > > +	unsigned int ret;
> > > > > +
> > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > > +						 dev_flow, error);
> > > > > +		break;
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > > > > +		break;
> > > > > +	default:
> > > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > NULL,
> > > > > +				"unsupported tunnel type");
> > > > > +		ret = 0;
> > > > > +		break;
> > > > > +	}
> > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Deletes tunneling interface by UDP port.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifindex
> > > > > + *   Network interface index of VXLAN device.
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + */
> > > > > +static void
> > > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > > +			    unsigned int ifindex,
> > > > > +			    struct mlx5_flow *dev_flow) {
> > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > +
> > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > +	vtep = NULL;
> > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +		if (vlst->ifindex == ifindex) {
> > > > > +			vtep = vlst;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > >
> > > > It is weird. You just can have vtep pointer in the
> > > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > > vtep->ifindex like the assertion below. Then, this lookup can be skipped.
> > >
> > > OK. Good optimization.
> > >
> > > >
> > > > > +	if (!vtep) {
> > > > > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > > > > +		goto exit;
> > > > > +	}
> > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > +		break;
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > +/*
> > > > > + * TODO: Remove the encap ancillary rules first.
> > > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> > > >
> > > > Is it a personal note? Please remove.
> > > OK.
> > >
> > > >
> > > > > +		break;
> > > > > +	default:
> > > > > +		assert(false);
> > > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > > +		break;
> > > > > +	}
> > > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > > +	assert(vtep->refcnt);
> > > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > > +		LIST_REMOVE(vtep, next);
> > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > +	}
> > > > > +exit:
> > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * Apply flow to E-Switch by sending Netlink message.
> > > > >   *
> > > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > > >  	       struct rte_flow_error *error)  {
> > > > >  	struct priv *priv = dev->data->dev_private;
> > > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > > >  	struct mlx5_flow *dev_flow;
> > > > >  	struct nlmsghdr *nlh;
> > > > > +	int ret;
> > > > >
> > > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > > >  	/* E-Switch flow can't be expanded. */
> > > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > > +	if (dev_flow->tcf.applied)
> > > > > +		return 0;
> > > > >  	nlh = dev_flow->tcf.nlh;
> > > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > > NLM_F_EXCL;
> > > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > +		/*
> > > > > +		 * Replace the interface index, target for
> > > > > +		 * encapsulation, source for decapsulation.
> > > > > +		 */
> > > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > +		/* Create actual VTEP device when rule is being applied. */
> > > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > > > > +					dev_flow, error);
> > > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > > +			return -rte_errno;
> > > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > > +	}
> > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > > >
> > > > ifindex_org looks a temporary storage in this code. And this kind of
> > > > hassle
> > > > (replace/restore) is there because you took the ifindex from the
> > > > netlink message. Why don't you have just
> > > >
> > > > struct mlx5_flow_tcf_tunnel_hdr {
> > > > 	uint32_t type; /**< Tunnel action type. */
> > > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > > */ };
> > > >
> > > > and don't change ifindex?
> > >
> > > I propose to use the local variable for ifindex_org and do not keep it
> > > in structure. *ifindex_ptr will keep.
> > 
> > Well, you still have to restore the ifindex whenever sending the nl msg. Most
> > of all, ifindex_ptr in nl msg isn't a right place to store the ifindex. 
> It is stored there for rules w/o tunnels. It is its "native" place, Id prefer
> not to create some new location to store the original index and save some space.
> We have to swap indices only if rule has requested the tunneling.  We can not

No no. At this point, flow is already created to be tunneled one. What do you
mean by 'rules w/o tunnels' or 'only if rule has requested the tunneling'?? It
has already been created as a vxlan tunnel rule. It won't be changed. The nlmsg
is supposed to have vtep ifindex but translation didn't know it and stored the
outer iface temporarily to get it replaced by vtep ifindex. It never be a
'native'/'original' place to store it. In which case the nl msg can be sent with
the 'original' ifindex? Any specific example? No.

> set tunnel index permanently, because rule can be applied/removed/reapplied
> and other new VXLAN device with new index can be recreated.

Every time it is applied, it will get the vtep and overwrite vtep ifindex in the
nl msg.

> > have vtep ifindex but it just temporarily keeps the device ifindex until vtep is
> > created/found.

Thanks,
Yongseok
  
Slava Ovsiienko Oct. 29, 2018, 11:53 a.m. UTC | #6
> -----Original Message-----
> From: Yongseok Koh
> Sent: Saturday, October 27, 2018 1:43
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> management
> 
> On Fri, Oct 26, 2018 at 02:35:24AM -0700, Slava Ovsiienko wrote:
> > > -----Original Message-----
> > > From: Yongseok Koh
> > > Sent: Friday, October 26, 2018 9:26
> > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > management
> > >
> > > On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > > > -----Original Message-----
> > > > > From: Yongseok Koh
> > > > > Sent: Thursday, October 25, 2018 3:28
> > > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel
> > > > > devices management
> > > > >
> > > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko
> wrote:
> > > > > > VXLAN interfaces are dynamically created for each local UDP
> > > > > > port of outer networks and then used as targets for TC
> > > > > > "flower" filters in order to perform encapsulation. These
> > > > > > VXLAN interfaces are system-wide, the only one device with
> > > > > > given UDP port can exist in the system (the attempt of
> > > > > > creating another device with the same UDP local port returns
> > > > > > EEXIST), so PMD should support the shared device instances
> > > > > > database for PMD instances. These VXLAN implicitly created devices
> are called VTEPs (Virtual Tunnel End Points).
> > > > > >
> > > > > > Creation of the VTEP occurs at the moment of rule applying.
> > > > > > The link is set up, root ingress qdisc is also initialized.
> > > > > >
> > > > > > Encapsulation VTEPs are created on per port basis, the single
> > > > > > VTEP is attached to the outer interface and is shared for all
> > > > > > encapsulation rules on this interface. The source UDP port is
> > > > > > automatically selected in range 30000-60000.
> > > > > >
> > > > > > For decapsulaton one VTEP is created per every unique UDP
> > > > > > local port to accept tunnel traffic. The name of created VTEP
> > > > > > consists of prefix "vmlx_" and the number of UDP port in
> > > > > > decimal digits without leading zeros (vmlx_4789). The VTEP can
> > > > > > be preliminary created in the system before the launching
> > > > > > application, it allows to share	UDP ports between primary
> > > > > > and secondary processes.
> > > > > >
> > > > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > > > ---
> > > > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > > > ++++++++++++++++++++++++++++++++++++++-
> > > > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > index d6840d5..efa9c3b 100644
> > > > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > > > >  	return -err;
> > > > > >  }
> > > > > >
> > > > > > +/* VTEP device list is shared between PMD port instances. */
> > > > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
> static
> > > > > pthread_mutex_t
> > > > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > > > >
> > > > > What's the reason for choosing pthread_mutex instead of
> rte_*_lock?
> > > >
> > > > The sharing this database for secondary processes?
> > >
> > > The static variable isn't shared with sec proc. But you can leave it as is.
> >
> > Yes. The sharing just was assumed, not implemented yet.
> >
> > >
> > > > > > +
> > > > > > +/**
> > > > > > + * Deletes VTEP network device.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] vtep
> > > > > > + *   Object represinting the network device to delete. Memory
> > > > > > + *   allocated for this object is freed by routine.
> > > > > > + */
> > > > > > +static void
> > > > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > > > +	struct nlmsghdr *nlh;
> > > > > > +	struct ifinfomsg *ifm;
> > > > > > +	alignas(struct nlmsghdr)
> > > > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	assert(!vtep->refcnt);
> > > > > > +	if (vtep->created && vtep->ifindex) {
> > > > >
> > > > > First of all vtep->created seems of no use. It is introduced to
> > > > > select the error message in flow_tcf_create_iface(). I don't see
> > > > > any necessity to distinguish between 'vtep is allocated by
> > > > > rte_malloc()' and
> > > 'vtep is created in kernel'.
> > > >
> > > > created flag indicates the iface is created by our code.
> > > > The VXLAN decap devices must have the specified UDP port, we can
> > > > not create multiple VXLAN devices with the same UDP port - EEXIST
> > > > is returned. So, we have to share device. One option is create
> > > > device before DPDK application launch and use these pre-created
> devices.
> > > > Inthis case created flag is not set and VXLAN device is not
> > > > reinitialized, and
> > > not deleted.
> > >
> > > I can't see any code to use pre-created device (created even before
> > > dpdk app launch). Your code just tries to create 'vmlx_xxxx'. Even
> > > from your comment in [7/7] patch, PMD will cleanup any leftovers
> > > (existing vtep devices) on initialization. Your comment sounds conflicting
> and confusing.
> >
> > There are two types of VXLAN devices:
> >
> > - VXLAN decap, not attached to any ifouter. Provides the ingress UDP
> > port,  we try to share the devices of this type, because we may be
> > asked for  the specified UDP port. No device/rule cleanup and reinit
> needed.
> >
> > - VXLAN encap, should be attached to ifouter to provide strict egress
> > path, no need to share - egress UDP port does not matter. And we need
> > to cleanup ifouter, remove other attached VXLAN devices and rules,
> > because it is too hard to co-exist with some pre-created setup..
> 
> I knew that. But how can it justify the need of 'created' field in vtep struct?
> In this code, it is of no use. But will see how it is used in your v3.
> 
> > > > > And why do you need to check vtep->ifindex as well? If vtep is
> > > > > created in kernel and its ifindex isn't set, that should be an
> > > > > error which had to be hanled in flow_tcf_create_iface(). Such a
> > > > > vtep shouldn't
> > > exist.
> > > > Yes, if we did not get ifindex of device - vtep is not created, error
> returned.
> > > > We just can not operate w/o ifindex.
> > >
> > > I know ifindex is needed but my question was checking vtep->ifindex
> > > here looked redundant/unnecessary. But as you agreed on having
> > > create/get/release_iface(), it doesn't matter much.
> >
> > Yes. I agree, will refactor the code.
> >
> > >
> > > > > Also, the refcnt management is a bit strange. Please put an
> > > > > abstraction by adding create_iface(), get_iface() and
> > > > > release_iface(). In the get_ifce(),
> > > > > vtep->refcnt should be incremented. And in the release_iface(),
> > > > > vtep->it decrease the
> > > > OK. Good proposal. I'll refactor the code.
> > > >
> > > > > refcnt and if it reaches to zero, the iface can be removed.
> > > > > create_iface() will set the refcnt to 1. And if you refer to
> > > > > mlx5_hrxq_get(), it even does searching the list not by
> > > > > repeating the
> > > same lookup code here and there.
> > > > > That will make your code much simpler.
> > > > >
> > > > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> sizeof(*ifm));
> > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +		if (ret)
> > > > > > +			DRV_LOG(WARNING, "netlink: error deleting
> VXLAN
> > > > > "
> > > > > > +					 "encap/decap ifindex %u",
> > > > > > +					 ifm->ifi_index);
> > > > > > +	}
> > > > > > +	rte_free(vtep);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Creates VTEP network device.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifouter
> > > > > > + *   Outer interface to attach new-created VXLAN device
> > > > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > > > + * @param[in] port
> > > > > > + *   UDP port of created VTEP device.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + *
> > > > > > + * @return
> > > > > > + * Pointer to created device structure on success, NULL
> > > > > > +otherwise
> > > > > > + * and rte_errno is set.
> > > > > > + */
> > > > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > > > >
> > > > > Why negative(ifndef) first intead of positive(ifdef)?
> > > > Hm. Did I miss the rule. Positive #ifdef first? OK.
> > >
> > > No concrete rule but if there's no specific reason, it would be
> > > better to start from ifdef.
> > >
> > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf
> __rte_unused,
> > > > > > +		      unsigned int ifouter __rte_unused,
> > > > > > +		      uint16_t port __rte_unused,
> > > > > > +		      struct rte_flow_error *error) {
> > > > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > > > > > +			 "netlink: failed to create VTEP, "
> > > > > > +			 "VXLAN metadat is not supported by
> kernel");
> > > > >
> > > > > Typo.
> > > >
> > > > OK.  "metadata are not supported".
> > > > >
> > > > > > +	return NULL;
> > > > > > +}
> > > > > > +#else
> > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> > > > >
> > > > > How about adding 'vtep'? It sounds vague - creating a general
> interface.
> > > > > E.g., flow_tcf_create_vtep_iface()?
> > > >
> > > > OK.
> > > >
> > > > >
> > > > > > +		      unsigned int ifouter,
> > > > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > +	struct nlmsghdr *nlh;
> > > > > > +	struct ifinfomsg *ifm;
> > > > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > > > +	alignas(struct nlmsghdr)
> > > > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > > > >
> > > > > Use a macro for '128'. Can't know the meaning.
> > > > OK. I think we should calculate the buffer size explicitly.
> > > >
> > > > >
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > > > +		       SZ_NLATTR_NEST * 2 +
> > > > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > > > +	struct nlattr *na_info;
> > > > > > +	struct nlattr *na_vxlan;
> > > > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > > > >
> > > > > Use rte_cpu_to_be_*() instead.
> > > >
> > > > Yes, I'll recheck the whole code for this issue.
> > > >
> > > > >
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > > > +	if (!vtep) {
> > > > > > +		rte_flow_error_set
> > > > > > +			(error, ENOMEM,
> > > > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > +			 NULL, "unadble to allocate memory for
> VTEP desc");
> > > > > > +		return NULL;
> > > > > > +	}
> > > > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > > > +			.refcnt = 0,
> > > > > > +			.port = port,
> > > > > > +			.created = 0,
> > > > > > +			.ifouter = 0,
> > > > > > +			.ifindex = 0,
> > > > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > > > +	};
> > > > > > +	memset(buf, 0, sizeof(buf));
> > > > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > > > NLM_F_EXCL;
> > > > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > > > +	ifm->ifi_type = 0;
> > > > > > +	ifm->ifi_index = 0;
> > > > > > +	ifm->ifi_flags = IFF_UP;
> > > > > > +	ifm->ifi_change = 0xffffffff;
> > > > > > +	snprintf(name, sizeof(name), "%s%u",
> MLX5_VXLAN_DEVICE_PFX,
> > > > > port);
> > > > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > > > +	assert(na_info);
> > > > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > > > +	if (ifouter)
> > > > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > > > +	assert(na_vxlan);
> > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> 1);
> > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +	if (ret)
> > > > > > +		DRV_LOG(WARNING,
> > > > > > +			"netlink: VTEP %s create failure (%d)",
> > > > > > +			name, rte_errno);
> > > > > > +	else
> > > > > > +		vtep->created = 1;
> > > > >
> > > > > Flow of code here isn't smooth, thus could be error-prone. Most
> > > > > of all, I don't like ret has multiple meanings. ret should be
> > > > > return value but you are using it to store ifindex.
> > > > >
> > > > > > +	if (ret && ifouter)
> > > > > > +		ret = 0;
> > > > > > +	else
> > > > > > +		ret = if_nametoindex(name);
> > > > >
> > > > > If vtep isn't created and ifouter is set, then skip init below,
> > > > > which means, if
> > > >
> > > > ifouter is set for VXLAN encap devices. They should be attached to
> > > > ifouter and can not be shared. So, if ifouter I set - we do not
> > > > use the precreated/existing VXLAN devices. We have to create our
> > > > own not
> > > shared device.
> > >
> > > In your code (flow_tcf_encap_vtep_create()), it is shared by multiple
> flows.
> > > Do you mean it isn't shared between different outer ifaces? If so,
> > > that's for sure.
> > Sorry, I do not understand the question.
> > VXLAN encap device is attached to ifouter and shared by all flows with
> > this ifouter. No multiple VXLAN devices are attached to the same ifouter,
> only one.
> > VXLAN decap device has no attached ifouter, so it can not share it.
> 
> Yep, that's what I meant.
> 
> > > > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > > > But why do you want to try to call this API even if it failed to create
> vtep?
> > > > > Let's not make code flow convoluted even though it logically works.
> > > > > Let's make it straightforward.
> > > > >
> > > > > > +	if (ret) {
> > > > > > +		vtep->ifindex = ret;
> > > > > > +		vtep->ifouter = ifouter;
> > > > > > +		memset(buf, 0, sizeof(buf));
> > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> sizeof(*ifm));
> > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > +		ifm->ifi_type = 0;
> > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > +		ifm->ifi_flags = IFF_UP;
> > > > > > +		ifm->ifi_change = IFF_UP;
> > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +		if (ret) {
> > > > > > +			DRV_LOG(WARNING,
> > > > > > +				"netlink: VTEP %s set link up failure
> (%d)",
> > > > > > +				name, rte_errno);
> > > > > > +			rte_free(vtep);
> > > > > > +			rte_flow_error_set
> > > > > > +				(error, -errno,
> > > > > > +
> RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > NULL,
> > > > > > +				 "netlink: failed to set VTEP link up");
> > > > > > +			vtep = NULL;
> > > > > > +		} else {
> > > > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex,
> error);
> > > > > > +			if (ret)
> > > > > > +				DRV_LOG(WARNING,
> > > > > > +				"VTEP %s init failure (%d)", name,
> rte_errno);
> > > > > > +		}
> > > > > > +	} else {
> > > > > > +		DRV_LOG(WARNING,
> > > > > > +			"VTEP %s failed to get index (%d)", name,
> errno);
> > > > > > +		rte_flow_error_set
> > > > > > +			(error, -errno,
> > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > > > > > +			 !vtep->created ? "netlink: failed to create
> VTEP" :
> > > > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > > > +			 ret = 1;
> > > > >
> > > > > If it fails to create a vtep above, it will print out two
> > > > > warning messages and one rte_flow_error message. And it even
> > > > > selects message to print between two?
> > > > > And there's another info msg at the end even in case of failure.
> > > > > Do you really want to do this even with manipulating ret to
> > > > > change code path?  Not a good practice.
> > > > >
> > > > > Usually, code path should be straightforward for sucessful path
> > > > > and for errors/failures, return immediately or use 'goto' if
> > > > > there's need for
> > > cleanup.
> > > > >
> > > > > Please refactor entire function.
> > > >
> > > > I think I'll split it in two ones - for attached and potentially shared
> ifaces.
> > > > >
> > > > > > +	}
> > > > > > +	if (ret) {
> > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > +		vtep = NULL;
> > > > > > +	}
> > > > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ?
> "OK" :
> > > > > "error");
> > > > > > +	return vtep;
> > > > > > +}
> > > > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > > > +
> > > > > > +/**
> > > > > > + * Create target interface index for VXLAN tunneling
> decapsulation.
> > > > > > + * In order to share the UDP port within the other interfaces
> > > > > > +the
> > > > > > + * VXLAN device created as not attached to any interface (if
> created).
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + * @return
> > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > >
> > > > > Return negative errno in case of failure like others.
> > > >
> > > > Anyway, we have to return an index. If we do not return it as
> > > > function result we will need to provide some extra pointing
> > > > parameter, it
> > > complicates the code.
> > >
> > > You misunderstood it. See what I wrote below. The function still
> > > returns the index but in case of error, make it return negative errno
> instead of zero.
> > >
> > > > >
> > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > > rte_errno is set.
> > > > >
> > > > > > + */
> > > > > > +static unsigned int
> > > > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			   struct mlx5_flow *dev_flow,
> > > > > > +			   struct rte_flow_error *error) {
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > > +
> > > > > > +	vtep = NULL;
> > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +		if (vlst->port == port) {
> > > > > > +			vtep = vlst;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > >
> > > > > You just need one variable.
> > > >
> > > > Yes. There is a long story, I forgot to revert code to one
> > > > variable after
> > > debugging.
> > > > >
> > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > >
> > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > 		if (vtep->port == port)
> > > > > 			break;
> > > > > 	}
> > > > >
> > > > > > +	if (!vtep) {
> > > > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > +		if (vtep)
> > > > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> next);
> > > > > > +	} else {
> > > > > > +		if (vtep->ifouter) {
> > > > > > +			rte_flow_error_set(error, -errno,
> > > > > > +
> 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > NULL,
> > > > > > +				"Failed to create decap VTEP,
> attached "
> > > > > > +				"device with the same UDP port
> exists");
> > > > > > +				vtep = NULL;
> > > > >
> > > > > Making vtep null to skip the following code?
> > > >
> > > > Yes. To avoid multiple return operators in code.
> > >
> > > It's okay to have multiple returns. Why not?
> >
> > It is easy to miss the return in the midst of function  while
> refactoring/modifying the code.
> 
> Your code path doesn't look easy and free from error. Please refer to other
> control path functions in this PMD.
> 
> > > > > Please merge the two same
> > > > > if/else and make the code path strightforward. And which errno
> > > > > do you expect here?
> > > > > Should it be set EEXIST instead?
> > > > Not always. Netlink returns the code.
> > >
> > > No, that's not my point. Your code above sets errno instead of
> > > rte_errno or EEXIST.
> > >
> > > 	} else {
> > > 		if (vtep->ifouter) {
> > > 			rte_flow_error_set(error, -errno,
> > >
> > > Which one sets this errno? Here, it sets rte_errno because matched
> > > vtep
> > libmnl sets, while processing the Netlink reply message (callback.c of libmnl
> sources).
> 
> You still don't understand my point.
> 
> In this flow_tcf_decap_vtep_create(), if vtep is found (vtep != NULL), how
> can errno be set? Before the if/else, there's no libmnl call.
> 
> > > can't be used as it already has outer iface attached (error message
> > > isn't clear, please reword it too). I thought this should be EEXIST
> > > but you set errno to rte_errno but errno isn't valid at this point.
> > >
> > > >
> > > > >
> > > > > > +		}
> > > > > > +	}
> > > > > > +	if (vtep) {
> > > > > > +		vtep->refcnt++;
> > > > > > +		assert(vtep->ifindex);
> > > > > > +		return vtep->ifindex;
> > > > > > +	} else {
> > > > > > +		return 0;
> > > > > > +	}
> > > > >
> > > > > Why repeating same if/else?
> > > > >
> > > > >
> > > > > This is my suggestion but if you take my suggestion to have
> > > > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > > > Agree.
> > > >
> > > > >
> > > > > {
> > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > >
> > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > 		if (vtep->port == port)
> > > > > 			break;
> > > > > 	}
> > > > > 	if (vtep && vtep->ifouter)
> > > > > 		return rte_flow_error_set(... EEXIST ...);
> > > > > 	else if (vtep) {
> > > > > 		++vtep->refcnt;
> > > > > 	} else {
> > > > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > 		if (!vtep)
> > > > > 			return rte_flow_error_set(...);
> > > > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > > 	}
> > > > > 	assert(vtep->ifindex);
> > > > > 	return vtep->ifindex;
> > > > > }
> > > > >
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Creates target interface index for VXLAN tunneling
> encapsulation.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifouter
> > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + * @return
> > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > > + */
> > > > > > +static unsigned int
> > > > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			    unsigned int ifouter,
> > > > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > > > +			    struct rte_flow_error *error) {
> > > > > > +	static uint16_t encap_port =
> MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > +
> > > > > > +	assert(ifouter);
> > > > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > > > +	vtep = NULL;
> > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +		if (vlst->ifouter == ifouter) {
> > > > > > +			vtep = vlst;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > >
> > > > > Same here.
> > > > >
> > > > > > +	if (!vtep) {
> > > > > > +		uint16_t pcnt;
> > > > > > +
> > > > > > +		/* Not found, we should create the new attached
> VTEP. */
> > > > > > +/*
> > > > > > + * TODO: not implemented yet
> > > > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > > > >
> > > > > Personal note is not appropriate even though it is removed in
> > > > > the following patch.
> > > > >
> > > > > > +		for (pcnt = 0; pcnt <=
> (MLX5_VXLAN_PORT_RANGE_MAX
> > > > > > +				     -
> MLX5_VXLAN_PORT_RANGE_MIN);
> > > > > pcnt++) {
> > > > > > +			encap_port++;
> > > > > > +			/* Wraparound the UDP port index. */
> > > > > > +			if (encap_port <
> MLX5_VXLAN_PORT_RANGE_MIN
> > > > > ||
> > > > > > +			    encap_port >
> MLX5_VXLAN_PORT_RANGE_MAX)
> > > > > > +				encap_port =
> > > > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > > > +			/* Check whether UDP port is in already in
> use. */
> > > > > > +			vtep = NULL;
> > > > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +				if (vlst->port == encap_port) {
> > > > > > +					vtep = vlst;
> > > > > > +					break;
> > > > > > +				}
> > > > > > +			}
> > > > >
> > > > > If you want to find out an empty port number, you can use
> > > > > rte_bitmap instead of repeating searching the entire list for
> > > > > all possible port
> > > numbers.
> > > >
> > > > We do not expect too many VXLAN devices have been created. bitmap.
> > >
> > > +1, valid point.
> > >
> > > > > > +			if (vtep) {
> > > > > > +				vtep = NULL;
> > > > > > +				continue;
> > > > > > +			}
> > > > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > > > +						     encap_port,
> error);
> > > > > > +			if (vtep) {
> > > > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan,
> vtep,
> > > > > next);
> > > > > > +				break;
> > > > > > +			}
> > > > > > +			if (rte_errno != EEXIST)
> > > > > > +				break;
> > > > > > +		}
> > > > > > +	}
> > > > > > +	if (!vtep)
> > > > > > +		return 0;
> > > > > > +	vtep->refcnt++;
> > > > > > +	assert(vtep->ifindex);
> > > > > > +	return vtep->ifindex;
> > > > >
> > > > > Please refactor this func according to what I suggested for
> > > > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Creates target interface index for tunneling of any type.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifouter
> > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + * @return
> > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > >
> > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > >  *   rte_errno is set.
> > > > >
> > > > > > + */
> > > > > > +static unsigned int
> > > > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			    unsigned int ifouter,
> > > > > > +			    struct mlx5_flow *dev_flow,
> > > > > > +			    struct rte_flow_error *error) {
> > > > > > +	unsigned int ret;
> > > > > > +
> > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > > > +						 dev_flow, error);
> > > > > > +		break;
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow,
> error);
> > > > > > +		break;
> > > > > > +	default:
> > > > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > > > +
> 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > NULL,
> > > > > > +				"unsupported tunnel type");
> > > > > > +		ret = 0;
> > > > > > +		break;
> > > > > > +	}
> > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Deletes tunneling interface by UDP port.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifindex
> > > > > > + *   Network interface index of VXLAN device.
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + */
> > > > > > +static void
> > > > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			    unsigned int ifindex,
> > > > > > +			    struct mlx5_flow *dev_flow) {
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > +
> > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > +	vtep = NULL;
> > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +		if (vlst->ifindex == ifindex) {
> > > > > > +			vtep = vlst;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > >
> > > > > It is weird. You just can have vtep pointer in the
> > > > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > > > vtep->ifindex like the assertion below. Then, this lookup can be
> skipped.
> > > >
> > > > OK. Good optimization.
> > > >
> > > > >
> > > > > > +	if (!vtep) {
> > > > > > +		DRV_LOG(WARNING, "No VTEP device found in the
> list");
> > > > > > +		goto exit;
> > > > > > +	}
> > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > +		break;
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > +/*
> > > > > > + * TODO: Remove the encap ancillary rules first.
> > > > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> > > > > > +*/
> > > > >
> > > > > Is it a personal note? Please remove.
> > > > OK.
> > > >
> > > > >
> > > > > > +		break;
> > > > > > +	default:
> > > > > > +		assert(false);
> > > > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > > > +		break;
> > > > > > +	}
> > > > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > > > +	assert(vtep->refcnt);
> > > > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > > > +		LIST_REMOVE(vtep, next);
> > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > +	}
> > > > > > +exit:
> > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * Apply flow to E-Switch by sending Netlink message.
> > > > > >   *
> > > > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > > > >  	       struct rte_flow_error *error)  {
> > > > > >  	struct priv *priv = dev->data->dev_private;
> > > > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > > > >  	struct mlx5_flow *dev_flow;
> > > > > >  	struct nlmsghdr *nlh;
> > > > > > +	int ret;
> > > > > >
> > > > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > > > >  	/* E-Switch flow can't be expanded. */
> > > > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > > > +	if (dev_flow->tcf.applied)
> > > > > > +		return 0;
> > > > > >  	nlh = dev_flow->tcf.nlh;
> > > > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > > > NLM_F_EXCL;
> > > > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > +		/*
> > > > > > +		 * Replace the interface index, target for
> > > > > > +		 * encapsulation, source for decapsulation.
> > > > > > +		 */
> > > > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > +		/* Create actual VTEP device when rule is being
> applied. */
> > > > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > > > +					*dev_flow->tcf.tunnel-
> >ifindex_ptr,
> > > > > > +					dev_flow, error);
> > > > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > > > +			return -rte_errno;
> > > > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > > > +	}
> > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > > > >
> > > > > ifindex_org looks a temporary storage in this code. And this
> > > > > kind of hassle
> > > > > (replace/restore) is there because you took the ifindex from the
> > > > > netlink message. Why don't you have just
> > > > >
> > > > > struct mlx5_flow_tcf_tunnel_hdr {
> > > > > 	uint32_t type; /**< Tunnel action type. */
> > > > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > > > */ };
> > > > >
> > > > > and don't change ifindex?
> > > >
> > > > I propose to use the local variable for ifindex_org and do not
> > > > keep it in structure. *ifindex_ptr will keep.
> > >
> > > Well, you still have to restore the ifindex whenever sending the nl
> > > msg. Most of all, ifindex_ptr in nl msg isn't a right place to store the
> ifindex.
> > It is stored there for rules w/o tunnels. It is its "native" place, Id
> > prefer not to create some new location to store the original index and save
> some space.
> > We have to swap indices only if rule has requested the tunneling.  We
> > can not
> 
> No no. At this point, flow is already created to be tunneled one. What do you
> mean by 'rules w/o tunnels' or 'only if rule has requested the tunneling'??

I mean the code handles all kind of rules - with tunnel and w/o tunnels.
The same code prepares the NL message for both rule types.

> It has already been created as a vxlan tunnel rule. It won't be changed. The
> nlmsg is supposed to have vtep ifindex but translation didn't know it and
> stored the outer iface temporarily to get it replaced by vtep ifindex. It never
> be a 'native'/'original' place to store it.

I mean, if rule does not request the tunneling action - it just keeps the
unchanged ifindex within Netlink message. If there is the tunneling - we replace
this index with some value depending on this ifindex.  We cannot replace
ifindex permanently at rule translation once, because VTEPs are created 
dynamically and VTEP ifindex can be different at the rule applying time.
So, we need to keep the original ifindex and create VTEP depending on it every
time rule is being applied.

> In which case the nl msg can be sent
> with the 'original' ifindex? Any specific example? No.
> 
> > set tunnel index permanently, because rule can be
> > applied/removed/reapplied and other new VXLAN device with new index
> >can be recreated.
> 
> Every time it is applied, it will get the vtep and overwrite vtep ifindex in the nl
> msg.

Yes. We should overwrite this field anyway, and every time at rule applying.
Because vtep ifindex can be different. And we need to keep the original ifindex
(for example to dynamically  create VTEP attached to it). Do you propose to keep
ifindex_org field?  Now, as I can see we have to keep ifindex_ptr field only.

> 
> > > have vtep ifindex but it just temporarily keeps the device ifindex
> > > until vtep is created/found.
> 
> Thanks,
> Yongseok
  
Yongseok Koh Oct. 29, 2018, 6:42 p.m. UTC | #7
On Mon, Oct 29, 2018 at 04:53:34AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Saturday, October 27, 2018 1:43
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> > 
> > On Fri, Oct 26, 2018 at 02:35:24AM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Friday, October 26, 2018 9:26
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > > management
> > > >
> > > > On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > > > > -----Original Message-----
> > > > > > From: Yongseok Koh
> > > > > > Sent: Thursday, October 25, 2018 3:28
> > > > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel
> > > > > > devices management
> > > > > >
> > > > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko
> > wrote:
> > > > > > > VXLAN interfaces are dynamically created for each local UDP
> > > > > > > port of outer networks and then used as targets for TC
> > > > > > > "flower" filters in order to perform encapsulation. These
> > > > > > > VXLAN interfaces are system-wide, the only one device with
> > > > > > > given UDP port can exist in the system (the attempt of
> > > > > > > creating another device with the same UDP local port returns
> > > > > > > EEXIST), so PMD should support the shared device instances
> > > > > > > database for PMD instances. These VXLAN implicitly created devices
> > are called VTEPs (Virtual Tunnel End Points).
> > > > > > >
> > > > > > > Creation of the VTEP occurs at the moment of rule applying.
> > > > > > > The link is set up, root ingress qdisc is also initialized.
> > > > > > >
> > > > > > > Encapsulation VTEPs are created on per port basis, the single
> > > > > > > VTEP is attached to the outer interface and is shared for all
> > > > > > > encapsulation rules on this interface. The source UDP port is
> > > > > > > automatically selected in range 30000-60000.
> > > > > > >
> > > > > > > For decapsulaton one VTEP is created per every unique UDP
> > > > > > > local port to accept tunnel traffic. The name of created VTEP
> > > > > > > consists of prefix "vmlx_" and the number of UDP port in
> > > > > > > decimal digits without leading zeros (vmlx_4789). The VTEP can
> > > > > > > be preliminary created in the system before the launching
> > > > > > > application, it allows to share	UDP ports between primary
> > > > > > > and secondary processes.
> > > > > > >
> > > > > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > > > > ---
> > > > > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > > > > ++++++++++++++++++++++++++++++++++++++-
> > > > > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > index d6840d5..efa9c3b 100644
> > > > > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > > > > >  	return -err;
> > > > > > >  }
> > > > > > >
> > > > > > > +/* VTEP device list is shared between PMD port instances. */
> > > > > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
> > static
> > > > > > pthread_mutex_t
> > > > > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > > > > >
> > > > > > What's the reason for choosing pthread_mutex instead of
> > rte_*_lock?
> > > > >
> > > > > The sharing this database for secondary processes?
> > > >
> > > > The static variable isn't shared with sec proc. But you can leave it as is.
> > >
> > > Yes. The sharing just was assumed, not implemented yet.
> > >
> > > >
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Deletes VTEP network device.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] vtep
> > > > > > > + *   Object represinting the network device to delete. Memory
> > > > > > > + *   allocated for this object is freed by routine.
> > > > > > > + */
> > > > > > > +static void
> > > > > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > > > > +	struct nlmsghdr *nlh;
> > > > > > > +	struct ifinfomsg *ifm;
> > > > > > > +	alignas(struct nlmsghdr)
> > > > > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	assert(!vtep->refcnt);
> > > > > > > +	if (vtep->created && vtep->ifindex) {
> > > > > >
> > > > > > First of all vtep->created seems of no use. It is introduced to
> > > > > > select the error message in flow_tcf_create_iface(). I don't see
> > > > > > any necessity to distinguish between 'vtep is allocated by
> > > > > > rte_malloc()' and
> > > > 'vtep is created in kernel'.
> > > > >
> > > > > created flag indicates the iface is created by our code.
> > > > > The VXLAN decap devices must have the specified UDP port, we can
> > > > > not create multiple VXLAN devices with the same UDP port - EEXIST
> > > > > is returned. So, we have to share device. One option is create
> > > > > device before DPDK application launch and use these pre-created
> > devices.
> > > > > Inthis case created flag is not set and VXLAN device is not
> > > > > reinitialized, and
> > > > not deleted.
> > > >
> > > > I can't see any code to use pre-created device (created even before
> > > > dpdk app launch). Your code just tries to create 'vmlx_xxxx'. Even
> > > > from your comment in [7/7] patch, PMD will cleanup any leftovers
> > > > (existing vtep devices) on initialization. Your comment sounds conflicting
> > and confusing.
> > >
> > > There are two types of VXLAN devices:
> > >
> > > - VXLAN decap, not attached to any ifouter. Provides the ingress UDP
> > > port,  we try to share the devices of this type, because we may be
> > > asked for  the specified UDP port. No device/rule cleanup and reinit
> > needed.
> > >
> > > - VXLAN encap, should be attached to ifouter to provide strict egress
> > > path, no need to share - egress UDP port does not matter. And we need
> > > to cleanup ifouter, remove other attached VXLAN devices and rules,
> > > because it is too hard to co-exist with some pre-created setup..
> > 
> > I knew that. But how can it justify the need of 'created' field in vtep struct?
> > In this code, it is of no use. But will see how it is used in your v3.
> > 
> > > > > > And why do you need to check vtep->ifindex as well? If vtep is
> > > > > > created in kernel and its ifindex isn't set, that should be an
> > > > > > error which had to be hanled in flow_tcf_create_iface(). Such a
> > > > > > vtep shouldn't
> > > > exist.
> > > > > Yes, if we did not get ifindex of device - vtep is not created, error
> > returned.
> > > > > We just can not operate w/o ifindex.
> > > >
> > > > I know ifindex is needed but my question was checking vtep->ifindex
> > > > here looked redundant/unnecessary. But as you agreed on having
> > > > create/get/release_iface(), it doesn't matter much.
> > >
> > > Yes. I agree, will refactor the code.
> > >
> > > >
> > > > > > Also, the refcnt management is a bit strange. Please put an
> > > > > > abstraction by adding create_iface(), get_iface() and
> > > > > > release_iface(). In the get_ifce(),
> > > > > > vtep->refcnt should be incremented. And in the release_iface(),
> > > > > > vtep->it decrease the
> > > > > OK. Good proposal. I'll refactor the code.
> > > > >
> > > > > > refcnt and if it reaches to zero, the iface can be removed.
> > > > > > create_iface() will set the refcnt to 1. And if you refer to
> > > > > > mlx5_hrxq_get(), it even does searching the list not by
> > > > > > repeating the
> > > > same lookup code here and there.
> > > > > > That will make your code much simpler.
> > > > > >
> > > > > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> > sizeof(*ifm));
> > > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +		if (ret)
> > > > > > > +			DRV_LOG(WARNING, "netlink: error deleting
> > VXLAN
> > > > > > "
> > > > > > > +					 "encap/decap ifindex %u",
> > > > > > > +					 ifm->ifi_index);
> > > > > > > +	}
> > > > > > > +	rte_free(vtep);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates VTEP network device.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Outer interface to attach new-created VXLAN device
> > > > > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > > > > + * @param[in] port
> > > > > > > + *   UDP port of created VTEP device.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + *
> > > > > > > + * @return
> > > > > > > + * Pointer to created device structure on success, NULL
> > > > > > > +otherwise
> > > > > > > + * and rte_errno is set.
> > > > > > > + */
> > > > > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > > > > >
> > > > > > Why negative(ifndef) first intead of positive(ifdef)?
> > > > > Hm. Did I miss the rule. Positive #ifdef first? OK.
> > > >
> > > > No concrete rule but if there's no specific reason, it would be
> > > > better to start from ifdef.
> > > >
> > > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf
> > __rte_unused,
> > > > > > > +		      unsigned int ifouter __rte_unused,
> > > > > > > +		      uint16_t port __rte_unused,
> > > > > > > +		      struct rte_flow_error *error) {
> > > > > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > > > > > +			 "netlink: failed to create VTEP, "
> > > > > > > +			 "VXLAN metadat is not supported by
> > kernel");
> > > > > >
> > > > > > Typo.
> > > > >
> > > > > OK.  "metadata are not supported".
> > > > > >
> > > > > > > +	return NULL;
> > > > > > > +}
> > > > > > > +#else
> > > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> > > > > >
> > > > > > How about adding 'vtep'? It sounds vague - creating a general
> > interface.
> > > > > > E.g., flow_tcf_create_vtep_iface()?
> > > > >
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > +		      unsigned int ifouter,
> > > > > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > > +	struct nlmsghdr *nlh;
> > > > > > > +	struct ifinfomsg *ifm;
> > > > > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > > > > +	alignas(struct nlmsghdr)
> > > > > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > > > > >
> > > > > > Use a macro for '128'. Can't know the meaning.
> > > > > OK. I think we should calculate the buffer size explicitly.
> > > > >
> > > > > >
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > > > > +		       SZ_NLATTR_NEST * 2 +
> > > > > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > > > > +	struct nlattr *na_info;
> > > > > > > +	struct nlattr *na_vxlan;
> > > > > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > > > > >
> > > > > > Use rte_cpu_to_be_*() instead.
> > > > >
> > > > > Yes, I'll recheck the whole code for this issue.
> > > > >
> > > > > >
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > > > > +	if (!vtep) {
> > > > > > > +		rte_flow_error_set
> > > > > > > +			(error, ENOMEM,
> > > > > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > > +			 NULL, "unadble to allocate memory for
> > VTEP desc");
> > > > > > > +		return NULL;
> > > > > > > +	}
> > > > > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > > > > +			.refcnt = 0,
> > > > > > > +			.port = port,
> > > > > > > +			.created = 0,
> > > > > > > +			.ifouter = 0,
> > > > > > > +			.ifindex = 0,
> > > > > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > > > > +	};
> > > > > > > +	memset(buf, 0, sizeof(buf));
> > > > > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > > > > NLM_F_EXCL;
> > > > > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > > > > +	ifm->ifi_type = 0;
> > > > > > > +	ifm->ifi_index = 0;
> > > > > > > +	ifm->ifi_flags = IFF_UP;
> > > > > > > +	ifm->ifi_change = 0xffffffff;
> > > > > > > +	snprintf(name, sizeof(name), "%s%u",
> > MLX5_VXLAN_DEVICE_PFX,
> > > > > > port);
> > > > > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > > > > +	assert(na_info);
> > > > > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > > > > +	if (ifouter)
> > > > > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > > > > +	assert(na_vxlan);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> > 1);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +	if (ret)
> > > > > > > +		DRV_LOG(WARNING,
> > > > > > > +			"netlink: VTEP %s create failure (%d)",
> > > > > > > +			name, rte_errno);
> > > > > > > +	else
> > > > > > > +		vtep->created = 1;
> > > > > >
> > > > > > Flow of code here isn't smooth, thus could be error-prone. Most
> > > > > > of all, I don't like ret has multiple meanings. ret should be
> > > > > > return value but you are using it to store ifindex.
> > > > > >
> > > > > > > +	if (ret && ifouter)
> > > > > > > +		ret = 0;
> > > > > > > +	else
> > > > > > > +		ret = if_nametoindex(name);
> > > > > >
> > > > > > If vtep isn't created and ifouter is set, then skip init below,
> > > > > > which means, if
> > > > >
> > > > > ifouter is set for VXLAN encap devices. They should be attached to
> > > > > ifouter and can not be shared. So, if ifouter I set - we do not
> > > > > use the precreated/existing VXLAN devices. We have to create our
> > > > > own not
> > > > shared device.
> > > >
> > > > In your code (flow_tcf_encap_vtep_create()), it is shared by multiple
> > flows.
> > > > Do you mean it isn't shared between different outer ifaces? If so,
> > > > that's for sure.
> > > Sorry, I do not understand the question.
> > > VXLAN encap device is attached to ifouter and shared by all flows with
> > > this ifouter. No multiple VXLAN devices are attached to the same ifouter,
> > only one.
> > > VXLAN decap device has no attached ifouter, so it can not share it.
> > 
> > Yep, that's what I meant.
> > 
> > > > > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > > > > But why do you want to try to call this API even if it failed to create
> > vtep?
> > > > > > Let's not make code flow convoluted even though it logically works.
> > > > > > Let's make it straightforward.
> > > > > >
> > > > > > > +	if (ret) {
> > > > > > > +		vtep->ifindex = ret;
> > > > > > > +		vtep->ifouter = ifouter;
> > > > > > > +		memset(buf, 0, sizeof(buf));
> > > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> > sizeof(*ifm));
> > > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > > +		ifm->ifi_type = 0;
> > > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > > +		ifm->ifi_flags = IFF_UP;
> > > > > > > +		ifm->ifi_change = IFF_UP;
> > > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +		if (ret) {
> > > > > > > +			DRV_LOG(WARNING,
> > > > > > > +				"netlink: VTEP %s set link up failure
> > (%d)",
> > > > > > > +				name, rte_errno);
> > > > > > > +			rte_free(vtep);
> > > > > > > +			rte_flow_error_set
> > > > > > > +				(error, -errno,
> > > > > > > +
> > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				 "netlink: failed to set VTEP link up");
> > > > > > > +			vtep = NULL;
> > > > > > > +		} else {
> > > > > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex,
> > error);
> > > > > > > +			if (ret)
> > > > > > > +				DRV_LOG(WARNING,
> > > > > > > +				"VTEP %s init failure (%d)", name,
> > rte_errno);
> > > > > > > +		}
> > > > > > > +	} else {
> > > > > > > +		DRV_LOG(WARNING,
> > > > > > > +			"VTEP %s failed to get index (%d)", name,
> > errno);
> > > > > > > +		rte_flow_error_set
> > > > > > > +			(error, -errno,
> > > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > > > > > +			 !vtep->created ? "netlink: failed to create
> > VTEP" :
> > > > > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > > > > +			 ret = 1;
> > > > > >
> > > > > > If it fails to create a vtep above, it will print out two
> > > > > > warning messages and one rte_flow_error message. And it even
> > > > > > selects message to print between two?
> > > > > > And there's another info msg at the end even in case of failure.
> > > > > > Do you really want to do this even with manipulating ret to
> > > > > > change code path?  Not a good practice.
> > > > > >
> > > > > > Usually, code path should be straightforward for sucessful path
> > > > > > and for errors/failures, return immediately or use 'goto' if
> > > > > > there's need for
> > > > cleanup.
> > > > > >
> > > > > > Please refactor entire function.
> > > > >
> > > > > I think I'll split it in two ones - for attached and potentially shared
> > ifaces.
> > > > > >
> > > > > > > +	}
> > > > > > > +	if (ret) {
> > > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > > +		vtep = NULL;
> > > > > > > +	}
> > > > > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ?
> > "OK" :
> > > > > > "error");
> > > > > > > +	return vtep;
> > > > > > > +}
> > > > > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Create target interface index for VXLAN tunneling
> > decapsulation.
> > > > > > > + * In order to share the UDP port within the other interfaces
> > > > > > > +the
> > > > > > > + * VXLAN device created as not attached to any interface (if
> > created).
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > >
> > > > > > Return negative errno in case of failure like others.
> > > > >
> > > > > Anyway, we have to return an index. If we do not return it as
> > > > > function result we will need to provide some extra pointing
> > > > > parameter, it
> > > > complicates the code.
> > > >
> > > > You misunderstood it. See what I wrote below. The function still
> > > > returns the index but in case of error, make it return negative errno
> > instead of zero.
> > > >
> > > > > >
> > > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > > > rte_errno is set.
> > > > > >
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			   struct mlx5_flow *dev_flow,
> > > > > > > +			   struct rte_flow_error *error) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > > > +
> > > > > > > +	vtep = NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->port == port) {
> > > > > > > +			vtep = vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > You just need one variable.
> > > > >
> > > > > Yes. There is a long story, I forgot to revert code to one
> > > > > variable after
> > > > debugging.
> > > > > >
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > >
> > > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > > 		if (vtep->port == port)
> > > > > > 			break;
> > > > > > 	}
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > > +		if (vtep)
> > > > > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > next);
> > > > > > > +	} else {
> > > > > > > +		if (vtep->ifouter) {
> > > > > > > +			rte_flow_error_set(error, -errno,
> > > > > > > +
> > 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				"Failed to create decap VTEP,
> > attached "
> > > > > > > +				"device with the same UDP port
> > exists");
> > > > > > > +				vtep = NULL;
> > > > > >
> > > > > > Making vtep null to skip the following code?
> > > > >
> > > > > Yes. To avoid multiple return operators in code.
> > > >
> > > > It's okay to have multiple returns. Why not?
> > >
> > > It is easy to miss the return in the midst of function  while
> > refactoring/modifying the code.
> > 
> > Your code path doesn't look easy and free from error. Please refer to other
> > control path functions in this PMD.
> > 
> > > > > > Please merge the two same
> > > > > > if/else and make the code path strightforward. And which errno
> > > > > > do you expect here?
> > > > > > Should it be set EEXIST instead?
> > > > > Not always. Netlink returns the code.
> > > >
> > > > No, that's not my point. Your code above sets errno instead of
> > > > rte_errno or EEXIST.
> > > >
> > > > 	} else {
> > > > 		if (vtep->ifouter) {
> > > > 			rte_flow_error_set(error, -errno,
> > > >
> > > > Which one sets this errno? Here, it sets rte_errno because matched
> > > > vtep
> > > libmnl sets, while processing the Netlink reply message (callback.c of libmnl
> > sources).
> > 
> > You still don't understand my point.
> > 
> > In this flow_tcf_decap_vtep_create(), if vtep is found (vtep != NULL), how
> > can errno be set? Before the if/else, there's no libmnl call.
> > 
> > > > can't be used as it already has outer iface attached (error message
> > > > isn't clear, please reword it too). I thought this should be EEXIST
> > > > but you set errno to rte_errno but errno isn't valid at this point.
> > > >
> > > > >
> > > > > >
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +	if (vtep) {
> > > > > > > +		vtep->refcnt++;
> > > > > > > +		assert(vtep->ifindex);
> > > > > > > +		return vtep->ifindex;
> > > > > > > +	} else {
> > > > > > > +		return 0;
> > > > > > > +	}
> > > > > >
> > > > > > Why repeating same if/else?
> > > > > >
> > > > > >
> > > > > > This is my suggestion but if you take my suggestion to have
> > > > > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > > > > Agree.
> > > > >
> > > > > >
> > > > > > {
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > >
> > > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > > 		if (vtep->port == port)
> > > > > > 			break;
> > > > > > 	}
> > > > > > 	if (vtep && vtep->ifouter)
> > > > > > 		return rte_flow_error_set(... EEXIST ...);
> > > > > > 	else if (vtep) {
> > > > > > 		++vtep->refcnt;
> > > > > > 	} else {
> > > > > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > 		if (!vtep)
> > > > > > 			return rte_flow_error_set(...);
> > > > > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > > > 	}
> > > > > > 	assert(vtep->ifindex);
> > > > > > 	return vtep->ifindex;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates target interface index for VXLAN tunneling
> > encapsulation.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			    unsigned int ifouter,
> > > > > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > > > > +			    struct rte_flow_error *error) {
> > > > > > > +	static uint16_t encap_port =
> > MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +
> > > > > > > +	assert(ifouter);
> > > > > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > > > > +	vtep = NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->ifouter == ifouter) {
> > > > > > > +			vtep = vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > Same here.
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		uint16_t pcnt;
> > > > > > > +
> > > > > > > +		/* Not found, we should create the new attached
> > VTEP. */
> > > > > > > +/*
> > > > > > > + * TODO: not implemented yet
> > > > > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > > > > >
> > > > > > Personal note is not appropriate even though it is removed in
> > > > > > the following patch.
> > > > > >
> > > > > > > +		for (pcnt = 0; pcnt <=
> > (MLX5_VXLAN_PORT_RANGE_MAX
> > > > > > > +				     -
> > MLX5_VXLAN_PORT_RANGE_MIN);
> > > > > > pcnt++) {
> > > > > > > +			encap_port++;
> > > > > > > +			/* Wraparound the UDP port index. */
> > > > > > > +			if (encap_port <
> > MLX5_VXLAN_PORT_RANGE_MIN
> > > > > > ||
> > > > > > > +			    encap_port >
> > MLX5_VXLAN_PORT_RANGE_MAX)
> > > > > > > +				encap_port =
> > > > > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > > > > +			/* Check whether UDP port is in already in
> > use. */
> > > > > > > +			vtep = NULL;
> > > > > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +				if (vlst->port == encap_port) {
> > > > > > > +					vtep = vlst;
> > > > > > > +					break;
> > > > > > > +				}
> > > > > > > +			}
> > > > > >
> > > > > > If you want to find out an empty port number, you can use
> > > > > > rte_bitmap instead of repeating searching the entire list for
> > > > > > all possible port
> > > > numbers.
> > > > >
> > > > > We do not expect too many VXLAN devices have been created. bitmap.
> > > >
> > > > +1, valid point.
> > > >
> > > > > > > +			if (vtep) {
> > > > > > > +				vtep = NULL;
> > > > > > > +				continue;
> > > > > > > +			}
> > > > > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > > > > +						     encap_port,
> > error);
> > > > > > > +			if (vtep) {
> > > > > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan,
> > vtep,
> > > > > > next);
> > > > > > > +				break;
> > > > > > > +			}
> > > > > > > +			if (rte_errno != EEXIST)
> > > > > > > +				break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +	if (!vtep)
> > > > > > > +		return 0;
> > > > > > > +	vtep->refcnt++;
> > > > > > > +	assert(vtep->ifindex);
> > > > > > > +	return vtep->ifindex;
> > > > > >
> > > > > > Please refactor this func according to what I suggested for
> > > > > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates target interface index for tunneling of any type.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > >
> > > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > > >  *   rte_errno is set.
> > > > > >
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			    unsigned int ifouter,
> > > > > > > +			    struct mlx5_flow *dev_flow,
> > > > > > > +			    struct rte_flow_error *error) {
> > > > > > > +	unsigned int ret;
> > > > > > > +
> > > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > > > > +						 dev_flow, error);
> > > > > > > +		break;
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow,
> > error);
> > > > > > > +		break;
> > > > > > > +	default:
> > > > > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > > > > +
> > 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				"unsupported tunnel type");
> > > > > > > +		ret = 0;
> > > > > > > +		break;
> > > > > > > +	}
> > > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Deletes tunneling interface by UDP port.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifindex
> > > > > > > + *   Network interface index of VXLAN device.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + */
> > > > > > > +static void
> > > > > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			    unsigned int ifindex,
> > > > > > > +			    struct mlx5_flow *dev_flow) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +
> > > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > > +	vtep = NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->ifindex == ifindex) {
> > > > > > > +			vtep = vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > It is weird. You just can have vtep pointer in the
> > > > > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > > > > vtep->ifindex like the assertion below. Then, this lookup can be
> > skipped.
> > > > >
> > > > > OK. Good optimization.
> > > > >
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		DRV_LOG(WARNING, "No VTEP device found in the
> > list");
> > > > > > > +		goto exit;
> > > > > > > +	}
> > > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > > +		break;
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > > +/*
> > > > > > > + * TODO: Remove the encap ancillary rules first.
> > > > > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> > > > > > > +*/
> > > > > >
> > > > > > Is it a personal note? Please remove.
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > +		break;
> > > > > > > +	default:
> > > > > > > +		assert(false);
> > > > > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > > > > +		break;
> > > > > > > +	}
> > > > > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > > > > +	assert(vtep->refcnt);
> > > > > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > > > > +		LIST_REMOVE(vtep, next);
> > > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > > +	}
> > > > > > > +exit:
> > > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   * Apply flow to E-Switch by sending Netlink message.
> > > > > > >   *
> > > > > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > > > > >  	       struct rte_flow_error *error)  {
> > > > > > >  	struct priv *priv = dev->data->dev_private;
> > > > > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > > > > >  	struct mlx5_flow *dev_flow;
> > > > > > >  	struct nlmsghdr *nlh;
> > > > > > > +	int ret;
> > > > > > >
> > > > > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > > > > >  	/* E-Switch flow can't be expanded. */
> > > > > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > > > > +	if (dev_flow->tcf.applied)
> > > > > > > +		return 0;
> > > > > > >  	nlh = dev_flow->tcf.nlh;
> > > > > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > > > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > > > > NLM_F_EXCL;
> > > > > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > > +		/*
> > > > > > > +		 * Replace the interface index, target for
> > > > > > > +		 * encapsulation, source for decapsulation.
> > > > > > > +		 */
> > > > > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		/* Create actual VTEP device when rule is being
> > applied. */
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > > > > +					*dev_flow->tcf.tunnel-
> > >ifindex_ptr,
> > > > > > > +					dev_flow, error);
> > > > > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > > > > +			return -rte_errno;
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > > > > +	}
> > > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > > > > >
> > > > > > ifindex_org looks a temporary storage in this code. And this
> > > > > > kind of hassle
> > > > > > (replace/restore) is there because you took the ifindex from the
> > > > > > netlink message. Why don't you have just
> > > > > >
> > > > > > struct mlx5_flow_tcf_tunnel_hdr {
> > > > > > 	uint32_t type; /**< Tunnel action type. */
> > > > > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > > > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > > > > */ };
> > > > > >
> > > > > > and don't change ifindex?
> > > > >
> > > > > I propose to use the local variable for ifindex_org and do not
> > > > > keep it in structure. *ifindex_ptr will keep.
> > > >
> > > > Well, you still have to restore the ifindex whenever sending the nl
> > > > msg. Most of all, ifindex_ptr in nl msg isn't a right place to store the
> > ifindex.
> > > It is stored there for rules w/o tunnels. It is its "native" place, Id
> > > prefer not to create some new location to store the original index and save
> > some space.
> > > We have to swap indices only if rule has requested the tunneling.  We
> > > can not
> > 
> > No no. At this point, flow is already created to be tunneled one. What do you
> > mean by 'rules w/o tunnels' or 'only if rule has requested the tunneling'??
> 
> I mean the code handles all kind of rules - with tunnel and w/o tunnels.
> The same code prepares the NL message for both rule types.
> 
> > It has already been created as a vxlan tunnel rule. It won't be changed. The
> > nlmsg is supposed to have vtep ifindex but translation didn't know it and
> > stored the outer iface temporarily to get it replaced by vtep ifindex. It never
> > be a 'native'/'original' place to store it.
> 
> I mean, if rule does not request the tunneling action - it just keeps the
> unchanged ifindex within Netlink message. If there is the tunneling - we replace
> this index with some value depending on this ifindex.  We cannot replace
> ifindex permanently at rule translation once, because VTEPs are created 
> dynamically and VTEP ifindex can be different at the rule applying time.
> So, we need to keep the original ifindex and create VTEP depending on it every
> time rule is being applied.

Rules w/o tunnels doesn't use the struct (mlx5_flow_tcf_tunnel_hdr) anyway. I
don't understand why you care.

> > In which case the nl msg can be sent
> > with the 'original' ifindex? Any specific example? No.
> > 
> > > set tunnel index permanently, because rule can be
> > > applied/removed/reapplied and other new VXLAN device with new index
> > >can be recreated.
> > 
> > Every time it is applied, it will get the vtep and overwrite vtep ifindex in the nl
> > msg.
> 
> Yes. We should overwrite this field anyway, and every time at rule applying.
> Because vtep ifindex can be different. And we need to keep the original ifindex
> (for example to dynamically  create VTEP attached to it). Do you propose to keep
> ifindex_org field?  Now, as I can see we have to keep ifindex_ptr field only.

Like I suggested above, yes. What I don't like is replace/restore the indexes
every time. I don't understand why you want to shuffling around a variable each
time. nlmsg is temporary anyway, there's no 'native' place. But I'll leave it up
to you. This isn't a super critical issue.


Thanks,
Yongseok
  

Patch

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index d6840d5..efa9c3b 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -3443,6 +3443,432 @@  struct pedit_parser {
 	return -err;
 }
 
+/* VTEP device list is shared between PMD port instances. */
+static LIST_HEAD(, mlx5_flow_tcf_vtep)
+			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
+static pthread_mutex_t vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+/**
+ * Deletes VTEP network device.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] vtep
+ *   Object represinting the network device to delete. Memory
+ *   allocated for this object is freed by routine.
+ */
+static void
+flow_tcf_delete_iface(struct mlx5_flow_tcf_context *tcf,
+		      struct mlx5_flow_tcf_vtep *vtep)
+{
+	struct nlmsghdr *nlh;
+	struct ifinfomsg *ifm;
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
+	int ret;
+
+	assert(!vtep->refcnt);
+	if (vtep->created && vtep->ifindex) {
+		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
+		nlh = mnl_nlmsg_put_header(buf);
+		nlh->nlmsg_type = RTM_DELLINK;
+		nlh->nlmsg_flags = NLM_F_REQUEST;
+		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+		ifm->ifi_family = AF_UNSPEC;
+		ifm->ifi_index = vtep->ifindex;
+		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+		if (ret)
+			DRV_LOG(WARNING, "netlink: error deleting VXLAN "
+					 "encap/decap ifindex %u",
+					 ifm->ifi_index);
+	}
+	rte_free(vtep);
+}
+
+/**
+ * Creates VTEP network device.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifouter
+ *   Outer interface to attach new-created VXLAN device
+ *   If zero the VXLAN device will not be attached to any device.
+ * @param[in] port
+ *   UDP port of created VTEP device.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ * Pointer to created device structure on success, NULL otherwise
+ * and rte_errno is set.
+ */
+#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
+static struct mlx5_flow_tcf_vtep*
+flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
+		      unsigned int ifouter __rte_unused,
+		      uint16_t port __rte_unused,
+		      struct rte_flow_error *error)
+{
+	rte_flow_error_set(error, ENOTSUP,
+			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			 "netlink: failed to create VTEP, "
+			 "VXLAN metadat is not supported by kernel");
+	return NULL;
+}
+#else
+static struct mlx5_flow_tcf_vtep*
+flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
+		      unsigned int ifouter,
+		      uint16_t port, struct rte_flow_error *error)
+{
+	struct mlx5_flow_tcf_vtep *vtep;
+	struct nlmsghdr *nlh;
+	struct ifinfomsg *ifm;
+	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
+		       SZ_NLATTR_DATA_OF(sizeof(name)) +
+		       SZ_NLATTR_NEST * 2 +
+		       SZ_NLATTR_STRZ_OF("vxlan") +
+		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
+		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
+		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
+		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
+	struct nlattr *na_info;
+	struct nlattr *na_vxlan;
+	rte_be16_t vxlan_port = RTE_BE16(port);
+	int ret;
+
+	vtep = rte_zmalloc(__func__, sizeof(*vtep),
+			alignof(struct mlx5_flow_tcf_vtep));
+	if (!vtep) {
+		rte_flow_error_set
+			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			 NULL, "unadble to allocate memory for VTEP desc");
+		return NULL;
+	}
+	*vtep = (struct mlx5_flow_tcf_vtep){
+			.refcnt = 0,
+			.port = port,
+			.created = 0,
+			.ifouter = 0,
+			.ifindex = 0,
+			.local = LIST_HEAD_INITIALIZER(),
+			.neigh = LIST_HEAD_INITIALIZER(),
+	};
+	memset(buf, 0, sizeof(buf));
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = RTM_NEWLINK;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  | NLM_F_EXCL;
+	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+	ifm->ifi_family = AF_UNSPEC;
+	ifm->ifi_type = 0;
+	ifm->ifi_index = 0;
+	ifm->ifi_flags = IFF_UP;
+	ifm->ifi_change = 0xffffffff;
+	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX, port);
+	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
+	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
+	assert(na_info);
+	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
+	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
+	if (ifouter)
+		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
+	assert(na_vxlan);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
+	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
+	mnl_attr_nest_end(nlh, na_vxlan);
+	mnl_attr_nest_end(nlh, na_info);
+	assert(sizeof(buf) >= nlh->nlmsg_len);
+	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+	if (ret)
+		DRV_LOG(WARNING,
+			"netlink: VTEP %s create failure (%d)",
+			name, rte_errno);
+	else
+		vtep->created = 1;
+	if (ret && ifouter)
+		ret = 0;
+	else
+		ret = if_nametoindex(name);
+	if (ret) {
+		vtep->ifindex = ret;
+		vtep->ifouter = ifouter;
+		memset(buf, 0, sizeof(buf));
+		nlh = mnl_nlmsg_put_header(buf);
+		nlh->nlmsg_type = RTM_NEWLINK;
+		nlh->nlmsg_flags = NLM_F_REQUEST;
+		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+		ifm->ifi_family = AF_UNSPEC;
+		ifm->ifi_type = 0;
+		ifm->ifi_index = vtep->ifindex;
+		ifm->ifi_flags = IFF_UP;
+		ifm->ifi_change = IFF_UP;
+		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+		if (ret) {
+			DRV_LOG(WARNING,
+				"netlink: VTEP %s set link up failure (%d)",
+				name, rte_errno);
+			rte_free(vtep);
+			rte_flow_error_set
+				(error, -errno,
+				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				 "netlink: failed to set VTEP link up");
+			vtep = NULL;
+		} else {
+			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
+			if (ret)
+				DRV_LOG(WARNING,
+				"VTEP %s init failure (%d)", name, rte_errno);
+		}
+	} else {
+		DRV_LOG(WARNING,
+			"VTEP %s failed to get index (%d)", name, errno);
+		rte_flow_error_set
+			(error, -errno,
+			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			 !vtep->created ? "netlink: failed to create VTEP" :
+			 "netlink: failed to retrieve VTEP ifindex");
+			 ret = 1;
+	}
+	if (ret) {
+		flow_tcf_delete_iface(tcf, vtep);
+		vtep = NULL;
+	}
+	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" : "error");
+	return vtep;
+}
+#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
+
+/**
+ * Create target interface index for VXLAN tunneling decapsulation.
+ * In order to share the UDP port within the other interfaces the
+ * VXLAN device created as not attached to any interface (if created).
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ * @return
+ *   Interface index on success, zero otherwise and rte_errno is set.
+ */
+static unsigned int
+flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
+			   struct mlx5_flow *dev_flow,
+			   struct rte_flow_error *error)
+{
+	struct mlx5_flow_tcf_vtep *vtep, *vlst;
+	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
+
+	vtep = NULL;
+	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+		if (vlst->port == port) {
+			vtep = vlst;
+			break;
+		}
+	}
+	if (!vtep) {
+		vtep = flow_tcf_create_iface(tcf, 0, port, error);
+		if (vtep)
+			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
+	} else {
+		if (vtep->ifouter) {
+			rte_flow_error_set(error, -errno,
+				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				"Failed to create decap VTEP, attached "
+				"device with the same UDP port exists");
+				vtep = NULL;
+		}
+	}
+	if (vtep) {
+		vtep->refcnt++;
+		assert(vtep->ifindex);
+		return vtep->ifindex;
+	} else {
+		return 0;
+	}
+}
+
+/**
+ * Creates target interface index for VXLAN tunneling encapsulation.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifouter
+ *   Network interface index to attach VXLAN encap device to.
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ * @return
+ *   Interface index on success, zero otherwise and rte_errno is set.
+ */
+static unsigned int
+flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifouter,
+			    struct mlx5_flow *dev_flow __rte_unused,
+			    struct rte_flow_error *error)
+{
+	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
+	struct mlx5_flow_tcf_vtep *vtep, *vlst;
+
+	assert(ifouter);
+	/* Look whether the attached VTEP for encap is created. */
+	vtep = NULL;
+	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+		if (vlst->ifouter == ifouter) {
+			vtep = vlst;
+			break;
+		}
+	}
+	if (!vtep) {
+		uint16_t pcnt;
+
+		/* Not found, we should create the new attached VTEP. */
+/*
+ * TODO: not implemented yet
+ * flow_tcf_encap_iface_cleanup(tcf, ifouter);
+ * flow_tcf_encap_local_cleanup(tcf, ifouter);
+ * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
+ */
+		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
+				     - MLX5_VXLAN_PORT_RANGE_MIN); pcnt++) {
+			encap_port++;
+			/* Wraparound the UDP port index. */
+			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN ||
+			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
+				encap_port = MLX5_VXLAN_PORT_RANGE_MIN;
+			/* Check whether UDP port is in already in use. */
+			vtep = NULL;
+			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+				if (vlst->port == encap_port) {
+					vtep = vlst;
+					break;
+				}
+			}
+			if (vtep) {
+				vtep = NULL;
+				continue;
+			}
+			vtep = flow_tcf_create_iface(tcf, ifouter,
+						     encap_port, error);
+			if (vtep) {
+				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
+				break;
+			}
+			if (rte_errno != EEXIST)
+				break;
+		}
+	}
+	if (!vtep)
+		return 0;
+	vtep->refcnt++;
+	assert(vtep->ifindex);
+	return vtep->ifindex;
+}
+
+/**
+ * Creates target interface index for tunneling of any type.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifouter
+ *   Network interface index to attach VXLAN encap device to.
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ * @return
+ *   Interface index on success, zero otherwise and rte_errno is set.
+ */
+static unsigned int
+flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifouter,
+			    struct mlx5_flow *dev_flow,
+			    struct rte_flow_error *error)
+{
+	unsigned int ret;
+
+	assert(dev_flow->tcf.tunnel);
+	pthread_mutex_lock(&vtep_list_mutex);
+	switch (dev_flow->tcf.tunnel->type) {
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
+		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
+						 dev_flow, error);
+		break;
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
+		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
+		break;
+	default:
+		rte_flow_error_set(error, ENOTSUP,
+				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				"unsupported tunnel type");
+		ret = 0;
+		break;
+	}
+	pthread_mutex_unlock(&vtep_list_mutex);
+	return ret;
+}
+
+/**
+ * Deletes tunneling interface by UDP port.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifindex
+ *   Network interface index of VXLAN device.
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ */
+static void
+flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifindex,
+			    struct mlx5_flow *dev_flow)
+{
+	struct mlx5_flow_tcf_vtep *vtep, *vlst;
+
+	assert(dev_flow->tcf.tunnel);
+	pthread_mutex_lock(&vtep_list_mutex);
+	vtep = NULL;
+	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+		if (vlst->ifindex == ifindex) {
+			vtep = vlst;
+			break;
+		}
+	}
+	if (!vtep) {
+		DRV_LOG(WARNING, "No VTEP device found in the list");
+		goto exit;
+	}
+	switch (dev_flow->tcf.tunnel->type) {
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
+		break;
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
+/*
+ * TODO: Remove the encap ancillary rules first.
+ * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
+ * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
+ */
+		break;
+	default:
+		assert(false);
+		DRV_LOG(WARNING, "Unsupported tunnel type");
+		break;
+	}
+	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
+	assert(vtep->refcnt);
+	if (!vtep->refcnt || !--vtep->refcnt) {
+		LIST_REMOVE(vtep, next);
+		flow_tcf_delete_iface(tcf, vtep);
+	}
+exit:
+	pthread_mutex_unlock(&vtep_list_mutex);
+}
+
 /**
  * Apply flow to E-Switch by sending Netlink message.
  *
@@ -3461,18 +3887,61 @@  struct pedit_parser {
 	       struct rte_flow_error *error)
 {
 	struct priv *priv = dev->data->dev_private;
-	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
+	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
+	int ret;
 
 	dev_flow = LIST_FIRST(&flow->dev_flows);
 	/* E-Switch flow can't be expanded. */
 	assert(!LIST_NEXT(dev_flow, next));
+	if (dev_flow->tcf.applied)
+		return 0;
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_NEWTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
-	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
+	if (dev_flow->tcf.tunnel) {
+		/*
+		 * Replace the interface index, target for
+		 * encapsulation, source for decapsulation.
+		 */
+		assert(!dev_flow->tcf.tunnel->ifindex_tun);
+		assert(dev_flow->tcf.tunnel->ifindex_ptr);
+		/* Create actual VTEP device when rule is being applied. */
+		dev_flow->tcf.tunnel->ifindex_tun
+			= flow_tcf_tunnel_vtep_create(tcf,
+					*dev_flow->tcf.tunnel->ifindex_ptr,
+					dev_flow, error);
+			DRV_LOG(INFO, "Replace ifindex: %d->%d",
+				dev_flow->tcf.tunnel->ifindex_tun,
+				*dev_flow->tcf.tunnel->ifindex_ptr);
+		if (!dev_flow->tcf.tunnel->ifindex_tun)
+			return -rte_errno;
+		dev_flow->tcf.tunnel->ifindex_org
+			= *dev_flow->tcf.tunnel->ifindex_ptr;
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_tun;
+	}
+	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+	if (dev_flow->tcf.tunnel) {
+		DRV_LOG(INFO, "Restore ifindex: %d->%d",
+				dev_flow->tcf.tunnel->ifindex_org,
+				*dev_flow->tcf.tunnel->ifindex_ptr);
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_org;
+		dev_flow->tcf.tunnel->ifindex_org = 0;
+	}
+	if (!ret) {
+		dev_flow->tcf.applied = 1;
 		return 0;
+	}
+	DRV_LOG(WARNING, "netlink: failed to create TC rule (%d)", rte_errno);
+	if (dev_flow->tcf.tunnel->ifindex_tun) {
+		flow_tcf_tunnel_vtep_delete(tcf,
+					    dev_flow->tcf.tunnel->ifindex_tun,
+					    dev_flow);
+		dev_flow->tcf.tunnel->ifindex_tun = 0;
+	}
 	return rte_flow_error_set(error, rte_errno,
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
 				  "netlink: failed to create TC flow rule");
@@ -3490,7 +3959,7 @@  struct pedit_parser {
 flow_tcf_remove(struct rte_eth_dev *dev, struct rte_flow *flow)
 {
 	struct priv *priv = dev->data->dev_private;
-	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
+	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
 
@@ -3501,10 +3970,36 @@  struct pedit_parser {
 		return;
 	/* E-Switch flow can't be expanded. */
 	assert(!LIST_NEXT(dev_flow, next));
+	if (!dev_flow->tcf.applied)
+		return;
+	if (dev_flow->tcf.tunnel) {
+		/*
+		 * Replace the interface index, target for
+		 * encapsulation, source for decapsulation.
+		 */
+		assert(dev_flow->tcf.tunnel->ifindex_tun);
+		assert(dev_flow->tcf.tunnel->ifindex_ptr);
+		dev_flow->tcf.tunnel->ifindex_org
+			= *dev_flow->tcf.tunnel->ifindex_ptr;
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_tun;
+	}
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_DELTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST;
-	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
+	flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+	if (dev_flow->tcf.tunnel) {
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_org;
+		dev_flow->tcf.tunnel->ifindex_org = 0;
+		if (dev_flow->tcf.tunnel->ifindex_tun) {
+			flow_tcf_tunnel_vtep_delete(tcf,
+					dev_flow->tcf.tunnel->ifindex_tun,
+					dev_flow);
+			dev_flow->tcf.tunnel->ifindex_tun = 0;
+		}
+	}
+	dev_flow->tcf.applied = 0;
 }
 
 /**