[RFC,14/14] net/mlx5: support SubFunction

Message ID 20210527140202.19377-5-xuemingl@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series mlx5: support SubFunction |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation warning apply issues

Commit Message

Xueming Li May 27, 2021, 2:02 p.m. UTC
  This patch introduces SF support. Similar to VF, SF on auxiliary bus is
a portion of hardware PF, no representor or bonding parameters for SF.

Devargs to support SF:
-a auxiliary:mlx5_core.sf.8,dv_flow_en=1

New global syntax to support SF:
-a bus=auxiliary,name=mlx5_core.sf.8/class=eth/driver=mlx5,dv_flow_en=1

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/mlx5.rst                | 339 +++++++++++++++++++++++-
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  12 +-
 drivers/net/mlx5/linux/mlx5_os.c        | 142 +++++++---
 drivers/net/mlx5/linux/mlx5_os.h        |   2 +
 drivers/net/mlx5/mlx5.c                 |  10 +-
 drivers/net/mlx5/mlx5.h                 |   1 +
 drivers/net/mlx5/mlx5_rxmode.c          |   8 +-
 drivers/net/mlx5/mlx5_trigger.c         |   2 +-
 8 files changed, 452 insertions(+), 64 deletions(-)
  

Patch

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 83299646dd..3f5692038c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -403,6 +403,300 @@  Limitations
   - Hairpin between two ports could only manual binding and explicit Tx flow mode. For single port hairpin, all the combinations of auto/manual binding and explicit/implicit Tx flow mode could be supported.
   - Hairpin in switchdev SR-IOV mode is not supported till now.
 
+- Meter:
+
+Limitations
+-----------
+
+- Windows support:
+
+  On Windows, the features are limited:
+
+  - Promiscuous mode is not supported
+  - The following rules are supported:
+
+    - IPv4/UDP with CVLAN filtering
+    - Unicast MAC filtering
+
+- For secondary process:
+
+  - Forked secondary process not supported.
+  - External memory unregistered in EAL memseg list cannot be used for DMA
+    unless such memory has been registered by ``mlx5_mr_update_ext_mp()`` in
+    primary process and remapped to the same virtual address in secondary
+    process. If the external memory is registered by primary process but has
+    different virtual address in secondary process, unexpected error may happen.
+
+- When using Verbs flow engine (``dv_flow_en`` = 0), flow pattern without any
+  specific VLAN will match for VLAN packets as well:
+
+  When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card.
+  Meaning, the flow rule::
+
+        flow create 0 ingress pattern eth / vlan vid is 3 / ipv4 / end ...
+
+  Will only match vlan packets with vid=3. and the flow rule::
+
+        flow create 0 ingress pattern eth / ipv4 / end ...
+
+  Will match any ipv4 packet (VLAN included).
+
+- When using Verbs flow engine (``dv_flow_en`` = 0), multi-tagged(QinQ) match is not supported.
+
+- When using DV flow engine (``dv_flow_en`` = 1), flow pattern with any VLAN specification will match only single-tagged packets unless the ETH item ``type`` field is 0x88A8 or the VLAN item ``has_more_vlan`` field is 1.
+  The flow rule::
+
+        flow create 0 ingress pattern eth / ipv4 / end ...
+
+  Will match any ipv4 packet.
+  The flow rules::
+
+        flow create 0 ingress pattern eth / vlan / end ...
+        flow create 0 ingress pattern eth has_vlan is 1 / end ...
+        flow create 0 ingress pattern eth type is 0x8100 / end ...
+
+  Will match single-tagged packets only, with any VLAN ID value.
+  The flow rules::
+
+        flow create 0 ingress pattern eth type is 0x88A8 / end ...
+        flow create 0 ingress pattern eth / vlan has_more_vlan is 1 / end ...
+
+  Will match multi-tagged packets only, with any VLAN ID value.
+
+- A flow pattern with 2 sequential VLAN items is not supported.
+
+- VLAN pop offload command:
+
+  - Flow rules having a VLAN pop offload command as one of their actions and
+    are lacking a match on VLAN as one of their items are not supported.
+  - The command is not supported on egress traffic in NIC mode.
+
+- VLAN push offload is not supported on ingress traffic in NIC mode.
+
+- VLAN set PCP offload is not supported on existing headers.
+
+- A multi segment packet must have not more segments than reported by dev_infos_get()
+  in tx_desc_lim.nb_seg_max field. This value depends on maximal supported Tx descriptor
+  size and ``txq_inline_min`` settings and may be from 2 (worst case forced by maximal
+  inline settings) to 58.
+
+- Flows with a VXLAN Network Identifier equal (or ends to be equal)
+  to 0 are not supported.
+
+- L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP.
+
+- Match on Geneve header supports the following fields only:
+
+     - VNI
+     - OAM
+     - protocol type
+     - options length
+
+- Match on Geneve TLV option is supported on the following fields:
+
+     - Class
+     - Type
+     - Length
+     - Data
+
+  Only one Class/Type/Length Geneve TLV option is supported per shared device.
+  Class/Type/Length fields must be specified as well as masks.
+  Class/Type/Length specified masks must be full.
+  Matching Geneve TLV option without specifying data is not supported.
+  Matching Geneve TLV option with ``data & mask == 0`` is not supported.
+
+- VF: flow rules created on VF devices can only match traffic targeted at the
+  configured MAC addresses (see ``rte_eth_dev_mac_addr_add()``).
+
+- Match on GTP tunnel header item supports the following fields only:
+
+     - v_pt_rsv_flags: E flag, S flag, PN flag
+     - msg_type
+     - teid
+
+- Match on GTP extension header only for GTP PDU session container (next
+  extension header type = 0x85).
+- Match on GTP extension header is not supported in group 0.
+
+- No Tx metadata go to the E-Switch steering domain for the Flow group 0.
+  The flows within group 0 and set metadata action are rejected by hardware.
+
+.. note::
+
+   MAC addresses not already present in the bridge table of the associated
+   kernel network device will be added and cleaned up by the PMD when closing
+   the device. In case of ungraceful program termination, some entries may
+   remain present and should be removed manually by other means.
+
+- Buffer split offload is supported with regular Rx burst routine only,
+  no MPRQ feature or vectorized code can be engaged.
+
+- When Multi-Packet Rx queue is configured (``mprq_en``), a Rx packet can be
+  externally attached to a user-provided mbuf with having EXT_ATTACHED_MBUF in
+  ol_flags. As the mempool for the external buffer is managed by PMD, all the
+  Rx mbufs must be freed before the device is closed. Otherwise, the mempool of
+  the external buffers will be freed by PMD and the application which still
+  holds the external buffers may be corrupted.
+
+- If Multi-Packet Rx queue is configured (``mprq_en``) and Rx CQE compression is
+  enabled (``rxq_cqe_comp_en``) at the same time, RSS hash result is not fully
+  supported. Some Rx packets may not have PKT_RX_RSS_HASH.
+
+- IPv6 Multicast messages are not supported on VM, while promiscuous mode
+  and allmulticast mode are both set to off.
+  To receive IPv6 Multicast messages on VM, explicitly set the relevant
+  MAC address using rte_eth_dev_mac_addr_add() API.
+
+- To support a mixed traffic pattern (some buffers from local host memory, some
+  buffers from other devices) with high bandwidth, a mbuf flag is used.
+
+  An application hints the PMD whether or not it should try to inline the
+  given mbuf data buffer. PMD should do the best effort to act upon this request.
+
+  The hint flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE`` is dynamic,
+  registered by application with rte_mbuf_dynflag_register(). This flag is
+  purely driver-specific and declared in PMD specific header ``rte_pmd_mlx5.h``,
+  which is intended to be used by the application.
+
+  To query the supported specific flags in runtime,
+  the function ``rte_pmd_mlx5_get_dyn_flag_names`` returns the array of
+  currently (over present hardware and configuration) supported specific flags.
+  The "not inline hint" feature operating flow is the following one:
+
+    - application starts
+    - probe the devices, ports are created
+    - query the port capabilities
+    - if port supporting the feature is found
+    - register dynamic flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE``
+    - application starts the ports
+    - on ``dev_start()`` PMD checks whether the feature flag is registered and
+      enables the feature support in datapath
+    - application might set the registered flag bit in ``ol_flags`` field
+      of mbuf being sent and PMD will handle ones appropriately.
+
+- The amount of descriptors in Tx queue may be limited by data inline settings.
+  Inline data require the more descriptor building blocks and overall block
+  amount may exceed the hardware supported limits. The application should
+  reduce the requested Tx size or adjust data inline settings with
+  ``txq_inline_max`` and ``txq_inline_mpw`` devargs keys.
+
+- To provide the packet send scheduling on mbuf timestamps the ``tx_pp``
+  parameter should be specified.
+  When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet
+  being sent it tries to synchronize the time of packet appearing on
+  the wire with the specified packet timestamp. It the specified one
+  is in the past it should be ignored, if one is in the distant future
+  it should be capped with some reasonable value (in range of seconds).
+  These specific cases ("too late" and "distant future") can be optionally
+  reported via device xstats to assist applications to detect the
+  time-related problems.
+
+  The timestamp upper "too-distant-future" limit
+  at the moment of invoking the Tx burst routine
+  can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23.
+  Please note, for the testpmd txonly mode,
+  the limit is deduced from the expression::
+
+        (n_tx_descriptors / burst_size + 1) * inter_burst_gap
+
+  There is no any packet reordering according timestamps is supposed,
+  neither within packet burst, nor between packets, it is an entirely
+  application responsibility to generate packets and its timestamps
+  in desired order. The timestamps can be put only in the first packet
+  in the burst providing the entire burst scheduling.
+
+- E-Switch decapsulation Flow:
+
+  - can be applied to PF port only.
+  - must specify VF port action (packet redirection from PF to VF).
+  - optionally may specify tunnel inner source and destination MAC addresses.
+
+- E-Switch  encapsulation Flow:
+
+  - can be applied to VF ports only.
+  - must specify PF port action (packet redirection from VF to PF).
+
+- Raw encapsulation:
+
+  - The input buffer, used as outer header, is not validated.
+
+- Raw decapsulation:
+
+  - The decapsulation is always done up to the outermost tunnel detected by the HW.
+  - The input buffer, providing the removal size, is not validated.
+  - The buffer size must match the length of the headers to be removed.
+
+- ICMP(code/type/identifier/sequence number) / ICMP6(code/type) matching, IP-in-IP and MPLS flow matching are all
+  mutually exclusive features which cannot be supported together
+  (see :ref:`mlx5_firmware_config`).
+
+- LRO:
+
+  - Requires DevX and DV flow to be enabled.
+  - KEEP_CRC offload cannot be supported with LRO.
+  - The first mbuf length, without head-room,  must be big enough to include the
+    TCP header (122B).
+  - Rx queue with LRO offload enabled, receiving a non-LRO packet, can forward
+    it with size limited to max LRO size, not to max RX packet length.
+  - LRO can be used with outer header of TCP packets of the standard format:
+        eth (with or without vlan) / ipv4 or ipv6 / tcp / payload
+
+    Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum.
+  - LRO packet aggregation is performed by HW only for packet size larger than
+    ``lro_min_mss_size``. This value is reported on device start, when debug
+    mode is enabled.
+
+- CRC:
+
+  - ``DEV_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation
+    for some NICs (such as ConnectX-6 Dx, ConnectX-6 Lx, and BlueField-2).
+    The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support.
+
+- TX mbuf fast free:
+
+  - fast free offload assumes the all mbufs being sent are originated from the
+    same memory pool and there is no any extra references to the mbufs (the
+    reference counter for each mbuf is equal 1 on tx_burst call). The latter
+    means there should be no any externally attached buffers in mbufs. It is
+    an application responsibility to provide the correct mbufs if the fast
+    free offload is engaged. The mlx5 PMD implicitly produces the mbufs with
+    externally attached buffers if MPRQ option is enabled, hence, the fast
+    free offload is neither supported nor advertised if there is MPRQ enabled.
+
+- Sample flow:
+
+  - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and
+    E-Switch steering domain.
+  - For E-Switch Sampling flow with sample ratio > 1, additional actions are not
+    supported in the sample actions list.
+  - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as
+    first action in the E-Switch egress flow if with header modify or
+    encapsulation actions.
+  - For NIC Rx flow, supports ``MARK``, ``COUNT``, ``QUEUE``, ``RSS`` in the
+    sample actions list.
+  - For E-Switch mirroring flow, supports ``RAW ENCAP``, ``Port ID``,
+    ``VXLAN ENCAP``, ``NVGRE ENCAP`` in the sample actions list.
+
+- Modify Field flow:
+
+  - Supports the 'set' operation only for ``RTE_FLOW_ACTION_TYPE_MODIFY_FIELD`` action.
+  - Modification of an arbitrary place in a packet via the special ``RTE_FLOW_FIELD_START`` Field ID is not supported.
+  - Modification of the 802.1Q Tag, VXLAN Network or GENEVE Network ID's is not supported.
+  - Encapsulation levels are not supported, can modify outermost header fields only.
+  - Offsets must be 32-bits aligned, cannot skip past the boundary of a field.
+
+- IPv6 header item 'proto' field, indicating the next header protocol, should
+  not be set as extension header.
+  In case the next header is an extension header, it should not be specified in
+  IPv6 header item 'proto' field.
+  The last extension header item 'next header' field can specify the following
+  header protocol type.
+
+- Hairpin:
+
+  - Hairpin between two ports could only manual binding and explicit Tx flow mode. For single port hairpin, all the combinations of auto/manual binding and explicit/implicit Tx flow mode could be supported.
+  - Hairpin in switchdev SR-IOV mode is not supported till now.
+
 - Meter:
 
   - All the meter colors with drop action will be counted only by the global drop statistics.
@@ -1438,13 +1732,17 @@  the DPDK application.
 
         echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
 
-Sub-Function representor
-------------------------
+SubFunction support
+-------------------
+SubFunction is a portion of the PCI device, a SF netdev has its own
+dedicated queues(txq, rxq). A SF shares PCI level resources with other SFs
+and/or with its parent PCI function.
 
-Sub-Function is a portion of the PCI device, a SF netdev has its own
-dedicated queues(txq, rxq). A SF netdev supports E-Switch representation
-offload similar to existing PF and VF representors. A SF shares PCI
-level resources with other SFs and/or with its parent PCI function.
+0. Requirement::
+
+        kernel version >= 5.12 or OFED version >= 5.6
+
+        iproute2 >= 5.11
 
 1. Configure SF feature::
 
@@ -1457,21 +1755,34 @@  level resources with other SFs and/or with its parent PCI function.
             2: 32 SFs
             3: 64 SFs
 
-2. Reset the FW::
+2. Enable switchdev mode::
 
-        mlxfwreset -d <mst device> reset
+        devlink dev eswitch set pci/<DBDF> mode switchdev
 
-3. Enable switchdev mode::
+3. Add SF port::
 
-        echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
+        devlink port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
+
+        Get SFID from output: pci/<DBDF>/<SFID>
+
+4. Modify MAC address::
+
+        devlink port function set pci/<DBDF>/<SFID> hw_addr <MAC>
+
+5. Activate SF port::
+
+        devlink port function set pci/<DBDF>/<ID> state active
 
-4. Create SF::
+6. Devargs to probe SF device::
 
-        mlnx-sf -d <PCI_BDF> -a create
+        auxiliary:mlx5_core.sf.9,dv_flow_en=1
 
-5. Probe SF representor::
+SubFunction representor support
+-------------------------------
+A SF netdev supports E-Switch representation offload similar to existing PF
+and VF representors. Use <sfnum> to probe SF representor.
 
-        testpmd> port attach <PCI_BDF>,representor=sf0,dv_flow_en=1
+        testpmd> port attach <PCI_BDF>,representor=sf<sfnum>,dv_flow_en=1
 
 Performance tuning
 ------------------
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 6fdb310129..8678502595 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -128,6 +128,17 @@  struct ethtool_link_settings {
 #define ETHTOOL_LINK_MODE_200000baseCR4_Full_BIT 2 /* 66 - 64 */
 #endif
 
+/* Get interface index from SubFunction device name. */
+int
+mlx5_auxiliary_get_ifindex(const char *sf_name)
+{
+	char if_name[IF_NAMESIZE];
+
+	if (mlx5_auxiliary_get_child_name(sf_name, "/net",
+					  if_name, sizeof(if_name)) != 0)
+		return -rte_errno;
+	return if_nametoindex(if_name);
+}
 
 /**
  * Get interface name from private structure.
@@ -1619,4 +1630,3 @@  mlx5_get_mac(struct rte_eth_dev *dev, uint8_t (*mac)[RTE_ETHER_ADDR_LEN])
 	memcpy(mac, request.ifr_hwaddr.sa_data, RTE_ETHER_ADDR_LEN);
 	return 0;
 }
-
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 4f16230fa5..d74273a7ca 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -20,6 +20,7 @@ 
 #include <ethdev_pci.h>
 #include <rte_pci.h>
 #include <rte_bus_pci.h>
+#include <rte_bus_auxiliary.h>
 #include <rte_common.h>
 #include <rte_kvargs.h>
 #include <rte_rwlock.h>
@@ -1923,6 +1924,27 @@  mlx5_device_bond_pci_match(const struct ibv_device *ibv_dev,
 	return pf;
 }
 
+static void
+mlx5_os_config_default(struct mlx5_dev_config *config)
+{
+	memset(config, 0, sizeof(*config));
+	config->mps = MLX5_ARG_UNSET;
+	config->dbnc = MLX5_ARG_UNSET;
+	config->rx_vec_en = 1;
+	config->txq_inline_max = MLX5_ARG_UNSET;
+	config->txq_inline_min = MLX5_ARG_UNSET;
+	config->txq_inline_mpw = MLX5_ARG_UNSET;
+	config->txqs_inline = MLX5_ARG_UNSET;
+	config->vf_nl_en = 1;
+	config->mr_ext_memseg_en = 1;
+	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
+	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
+	config->dv_esw_en = 1;
+	config->dv_flow_en = 1;
+	config->decap_en = 1;
+	config->log_hp_size = MLX5_ARG_UNSET;
+}
+
 /**
  * Register a PCI device within bonding.
  *
@@ -2334,23 +2356,8 @@  mlx5_os_pci_probe_pf(struct rte_pci_device *pci_dev,
 		uint32_t restore;
 
 		/* Default configuration. */
-		memset(&dev_config, 0, sizeof(struct mlx5_dev_config));
+		mlx5_os_config_default(&dev_config);
 		dev_config.vf = dev_config_vf;
-		dev_config.mps = MLX5_ARG_UNSET;
-		dev_config.dbnc = MLX5_ARG_UNSET;
-		dev_config.rx_vec_en = 1;
-		dev_config.txq_inline_max = MLX5_ARG_UNSET;
-		dev_config.txq_inline_min = MLX5_ARG_UNSET;
-		dev_config.txq_inline_mpw = MLX5_ARG_UNSET;
-		dev_config.txqs_inline = MLX5_ARG_UNSET;
-		dev_config.vf_nl_en = 1;
-		dev_config.mr_ext_memseg_en = 1;
-		dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
-		dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
-		dev_config.dv_esw_en = 1;
-		dev_config.dv_flow_en = 1;
-		dev_config.decap_en = 1;
-		dev_config.log_hp_size = MLX5_ARG_UNSET;
 		list[i].eth_dev = mlx5_dev_spawn(&pci_dev->device,
 						 &list[i],
 						 &dev_config,
@@ -2407,6 +2414,35 @@  mlx5_os_pci_probe_pf(struct rte_pci_device *pci_dev,
 	return ret;
 }
 
+static int
+mlx5_os_parse_eth_devargs(struct rte_device *dev,
+			  struct rte_eth_devargs *eth_da)
+{
+	int ret = 0;
+
+	if (dev->devargs == NULL)
+		return 0;
+	memset(eth_da, 0, sizeof(*eth_da));
+	/* Parse representor information first from class argument. */
+	if (dev->devargs->cls_str)
+		ret = rte_eth_devargs_parse(dev->devargs->cls_str, eth_da);
+	if (ret != 0) {
+		DRV_LOG(ERR, "failed to parse device arguments: %s",
+			dev->devargs->cls_str);
+		return -rte_errno;
+	}
+	if (eth_da->type == RTE_ETH_REPRESENTOR_NONE) {
+		/* Parse legacy device argument */
+		ret = rte_eth_devargs_parse(dev->devargs->args, eth_da);
+		if (ret) {
+			DRV_LOG(ERR, "failed to parse device arguments: %s",
+				dev->devargs->args);
+			return -rte_errno;
+		}
+	}
+	return 0;
+}
+
 /**
  * Callback to register a PCI device.
  *
@@ -2421,31 +2457,13 @@  mlx5_os_pci_probe_pf(struct rte_pci_device *pci_dev,
 static int
 mlx5_os_pci_probe(struct rte_pci_device *pci_dev)
 {
-	struct rte_eth_devargs eth_da = { .type = RTE_ETH_REPRESENTOR_NONE };
+	struct rte_eth_devargs eth_da = { .nb_ports = 0 };
 	int ret = 0;
 	uint16_t p;
 
-	if (pci_dev->device.devargs) {
-		/* Parse representor information from device argument. */
-		if (pci_dev->device.devargs->cls_str)
-			ret = rte_eth_devargs_parse
-				(pci_dev->device.devargs->cls_str, &eth_da);
-		if (ret) {
-			DRV_LOG(ERR, "failed to parse device arguments: %s",
-				pci_dev->device.devargs->cls_str);
-			return -rte_errno;
-		}
-		if (eth_da.type == RTE_ETH_REPRESENTOR_NONE) {
-			/* Support legacy device argument */
-			ret = rte_eth_devargs_parse
-				(pci_dev->device.devargs->args, &eth_da);
-			if (ret) {
-				DRV_LOG(ERR, "failed to parse device arguments: %s",
-					pci_dev->device.devargs->args);
-				return -rte_errno;
-			}
-		}
-	}
+	ret = mlx5_os_parse_eth_devargs(&pci_dev->device, &eth_da);
+	if (ret != 0)
+		return ret;
 
 	if (eth_da.nb_ports > 0) {
 		/* Iterate all port if devargs pf is range: "pf[0-1]vf[...]". */
@@ -2458,10 +2476,53 @@  mlx5_os_pci_probe(struct rte_pci_device *pci_dev)
 	return ret;
 }
 
+/* Probe a single SF device on auxiliary bus, no representor support. */
+static int
+mlx5_os_auxiliary_probe(struct rte_device *dev)
+{
+	struct rte_eth_devargs eth_da = { .nb_ports = 0 };
+	struct mlx5_dev_config config;
+	struct mlx5_dev_spawn_data spawn = { .pf_bond = -1 };
+	struct rte_auxiliary_device *adev = RTE_DEV_TO_AUXILIARY(dev);
+	struct rte_eth_dev *eth_dev;
+	int ret = 0;
+
+	/* Parse ethdev devargs. */
+	ret = mlx5_os_parse_eth_devargs(dev, &eth_da);
+	if (ret != 0)
+		return ret;
+	/* Set default config data. */
+	mlx5_os_config_default(&config);
+	config.sf = 1;
+	/* Init spawn data. */
+	spawn.max_port = 1;
+	spawn.phys_port = 1;
+	spawn.phys_dev = mlx5_get_ibv_device(dev);
+	ret = mlx5_auxiliary_get_ifindex(dev->name);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get ethdev ifindex: %s", dev->name);
+		return ret;
+	}
+	spawn.ifindex = ret;
+	/* Spawn device. */
+	eth_dev = mlx5_dev_spawn(dev, &spawn, &config, &eth_da);
+	if (eth_dev == NULL)
+		return -rte_errno;
+	/* Post create. */
+	eth_dev->intr_handle = &adev->intr_handle;
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		eth_dev->data->dev_flags |= RTE_ETH_DEV_INTR_LSC;
+		eth_dev->data->dev_flags |= RTE_ETH_DEV_INTR_RMV;
+		eth_dev->data->numa_node = dev->numa_node;
+	}
+	rte_eth_dev_probing_finish(eth_dev);
+	return 0;
+}
+
 /**
  * Common bus driver callback to probe a device.
  *
- * This function probe PCI bus device(s).
+ * This function probe PCI bus device(s) or a single SF on auxiliary bus.
  *
  * @param[in] dev
  *   Pointer to the generic device.
@@ -2484,7 +2545,8 @@  mlx5_os_net_probe(struct rte_device *dev)
 	}
 	if (mlx5_dev_is_pci(dev))
 		return mlx5_os_pci_probe(RTE_DEV_TO_PCI(dev));
-	return 0;
+	else
+		return mlx5_os_auxiliary_probe(dev);
 }
 
 static int
diff --git a/drivers/net/mlx5/linux/mlx5_os.h b/drivers/net/mlx5/linux/mlx5_os.h
index af7cbeb418..2991d37df2 100644
--- a/drivers/net/mlx5/linux/mlx5_os.h
+++ b/drivers/net/mlx5/linux/mlx5_os.h
@@ -19,4 +19,6 @@  enum {
 
 #define MLX5_NAMESIZE IF_NAMESIZE
 
+int mlx5_auxiliary_get_ifindex(const char *sf_name);
+
 #endif /* RTE_PMD_MLX5_OS_H_ */
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 3defdb2db3..69edd55b86 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2319,10 +2319,12 @@  mlx5_eth_find_next(uint16_t port_id, struct rte_eth_dev *odev)
 			if (opriv->sh == priv->sh ||
 			    odev->device == dev->device)
 				break;
-		} else if (dev->device != NULL && dev->device->driver &&
-			dev->device->driver->name &&
-			!strcmp(dev->device->driver->name,
-				MLX5_PCI_DRIVER_NAME)) {
+		} else if (dev->device != NULL && dev->device->driver != NULL &&
+			dev->device->driver->name != NULL &&
+			(strcmp(dev->device->driver->name,
+				MLX5_PCI_DRIVER_NAME) == 0 ||
+			 strcmp(dev->device->driver->name,
+				MLX5_AUXILIARY_DRIVER_NAME) == 0)) {
 			/* odev not specified, found all mlx5 devices. */
 			break;
 		}
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 27bb34e827..b06f45fc54 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -220,6 +220,7 @@  struct mlx5_dev_config {
 	unsigned int hw_fcs_strip:1; /* FCS stripping is supported. */
 	unsigned int hw_padding:1; /* End alignment padding is supported. */
 	unsigned int vf:1; /* This is a VF. */
+	unsigned int sf:1; /* This is a SF. */
 	unsigned int tunnel_en:1;
 	/* Whether tunnel stateless offloads are supported. */
 	unsigned int mpls_en:1; /* MPLS over GRE/UDP is enabled. */
diff --git a/drivers/net/mlx5/mlx5_rxmode.c b/drivers/net/mlx5/mlx5_rxmode.c
index 25fb47c9ed..7f19b235c2 100644
--- a/drivers/net/mlx5/mlx5_rxmode.c
+++ b/drivers/net/mlx5/mlx5_rxmode.c
@@ -36,7 +36,7 @@  mlx5_promiscuous_enable(struct rte_eth_dev *dev)
 			dev->data->port_id);
 		return 0;
 	}
-	if (priv->config.vf) {
+	if (priv->config.vf || priv->config.sf) {
 		ret = mlx5_os_set_promisc(dev, 1);
 		if (ret)
 			return ret;
@@ -69,7 +69,7 @@  mlx5_promiscuous_disable(struct rte_eth_dev *dev)
 	int ret;
 
 	dev->data->promiscuous = 0;
-	if (priv->config.vf) {
+	if (priv->config.vf || priv->config.sf) {
 		ret = mlx5_os_set_promisc(dev, 0);
 		if (ret)
 			return ret;
@@ -109,7 +109,7 @@  mlx5_allmulticast_enable(struct rte_eth_dev *dev)
 			dev->data->port_id);
 		return 0;
 	}
-	if (priv->config.vf) {
+	if (priv->config.vf || priv->config.sf) {
 		ret = mlx5_os_set_allmulti(dev, 1);
 		if (ret)
 			goto error;
@@ -142,7 +142,7 @@  mlx5_allmulticast_disable(struct rte_eth_dev *dev)
 	int ret;
 
 	dev->data->all_multicast = 0;
-	if (priv->config.vf) {
+	if (priv->config.vf || priv->config.sf) {
 		ret = mlx5_os_set_allmulti(dev, 0);
 		if (ret)
 			goto error;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 6c8a64ce03..e4e057a6f8 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -1259,7 +1259,7 @@  mlx5_traffic_enable(struct rte_eth_dev *dev)
 		}
 		mlx5_txq_release(dev, i);
 	}
-	if (priv->config.dv_esw_en && !priv->config.vf) {
+	if (priv->config.dv_esw_en && !priv->config.vf && !priv->config.sf) {
 		if (mlx5_flow_create_esw_table_zero_flow(dev))
 			priv->fdb_def_rule = 1;
 		else