From patchwork Tue Apr 2 15:46:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xiaolong Ye X-Patchwork-Id: 52121 X-Patchwork-Delegate: ferruh.yigit@amd.com Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 55B524D3A; Tue, 2 Apr 2019 17:51:55 +0200 (CEST) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 3FE7D4CA9 for ; Tue, 2 Apr 2019 17:51:53 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Apr 2019 08:51:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,301,1549958400"; d="scan'208";a="132331464" Received: from yexl-server.sh.intel.com ([10.67.110.206]) by orsmga006.jf.intel.com with ESMTP; 02 Apr 2019 08:51:49 -0700 From: Xiaolong Ye To: dev@dpdk.org, Stephen Hemminger , Ferruh Yigit Cc: Qi Zhang , Karlsson Magnus , Topel Bjorn , Maxime Coquelin , Luca Boccassi , Bruce Richardson , Ananyev Konstantin , David Marchand , Andrew Rybchenko , Olivier Matz , Xiaolong Ye Date: Tue, 2 Apr 2019 23:46:53 +0800 Message-Id: <20190402154653.711-2-xiaolong.ye@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190402154653.711-1-xiaolong.ye@intel.com> References: <20190301080947.91086-1-xiaolong.ye@intel.com> <20190402154653.711-1-xiaolong.ye@intel.com> Subject: [dpdk-dev] [PATCH v9 1/1] net/af_xdp: introduce AF XDP PMD driver X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add a new PMD driver for AF_XDP which is a proposed faster version of AF_PACKET interface in Linux. More info about AF_XDP, please refer to [1] [2]. This is the vanilla version PMD which just uses a raw buffer registered as the umem. [1] https://fosdem.org/2018/schedule/event/af_xdp/ [2] https://lwn.net/Articles/745934/ Signed-off-by: Xiaolong Ye --- MAINTAINERS | 7 + config/common_base | 5 + doc/guides/nics/af_xdp.rst | 48 + doc/guides/nics/features/af_xdp.ini | 11 + doc/guides/nics/index.rst | 1 + doc/guides/rel_notes/release_19_05.rst | 7 + drivers/net/Makefile | 1 + drivers/net/af_xdp/Makefile | 32 + drivers/net/af_xdp/meson.build | 21 + drivers/net/af_xdp/rte_eth_af_xdp.c | 956 ++++++++++++++++++ drivers/net/af_xdp/rte_pmd_af_xdp_version.map | 3 + drivers/net/meson.build | 1 + mk/rte.app.mk | 1 + 13 files changed, 1094 insertions(+) create mode 100644 doc/guides/nics/af_xdp.rst create mode 100644 doc/guides/nics/features/af_xdp.ini create mode 100644 drivers/net/af_xdp/Makefile create mode 100644 drivers/net/af_xdp/meson.build create mode 100644 drivers/net/af_xdp/rte_eth_af_xdp.c create mode 100644 drivers/net/af_xdp/rte_pmd_af_xdp_version.map diff --git a/MAINTAINERS b/MAINTAINERS index e9ff2b4c2..c13ae8215 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -479,6 +479,13 @@ M: John W. Linville F: drivers/net/af_packet/ F: doc/guides/nics/features/afpacket.ini +Linux AF_XDP +M: Xiaolong Ye +M: Qi Zhang +F: drivers/net/af_xdp/ +F: doc/guides/nics/af_xdp.rst +F: doc/guides/nics/features/af_xdp.ini + Amazon ENA M: Marcin Wojtas M: Michal Krawczyk diff --git a/config/common_base b/config/common_base index 6292bc4af..b95ee03d7 100644 --- a/config/common_base +++ b/config/common_base @@ -430,6 +430,11 @@ CONFIG_RTE_LIBRTE_VMXNET3_DEBUG_TX_FREE=n # CONFIG_RTE_LIBRTE_PMD_AF_PACKET=n +# +# Compile software PMD backed by AF_XDP sockets (Linux only) +# +CONFIG_RTE_LIBRTE_PMD_AF_XDP=n + # # Compile link bonding PMD library # diff --git a/doc/guides/nics/af_xdp.rst b/doc/guides/nics/af_xdp.rst new file mode 100644 index 000000000..af675d910 --- /dev/null +++ b/doc/guides/nics/af_xdp.rst @@ -0,0 +1,48 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(c) 2019 Intel Corporation. + +AF_XDP Poll Mode Driver +========================== + +AF_XDP is an address family that is optimized for high performance +packet processing. AF_XDP sockets enable the possibility for XDP program to +redirect packets to a memory buffer in userspace. + +For the full details behind AF_XDP socket, you can refer to +`AF_XDP documentation in the Kernel +`_. + +This Linux-specific PMD driver creates the AF_XDP socket and binds it to a +specific netdev queue, it allows a DPDK application to send and receive raw +packets through the socket which would bypass the kernel network stack. +Current implementation only supports single queue, multi-queues feature will +be added later. + +Note that MTU of AF_XDP PMD is limited due to XDP lacks support for +fragmentation. + +Options +------- + +The following options can be provided to set up an af_xdp port in DPDK. + +* ``iface`` - name of the Kernel interface to attach to (required); +* ``queue`` - netdev queue id (optional, default 0); + +Prerequisites +------------- + +This is a Linux-specific PMD, thus the following prerequisites apply: + +* A Linux Kernel (version > 4.18) with XDP sockets configuration enabled; +* libbpf (within kernel version > 5.1) with latest af_xdp support installed +* A Kernel bound interface to attach to. + +Set up an af_xdp interface +----------------------------- + +The following example will set up an af_xdp interface in DPDK: + +.. code-block:: console + + --vdev net_af_xdp,iface=ens786f1,queue=0 diff --git a/doc/guides/nics/features/af_xdp.ini b/doc/guides/nics/features/af_xdp.ini new file mode 100644 index 000000000..36953c2de --- /dev/null +++ b/doc/guides/nics/features/af_xdp.ini @@ -0,0 +1,11 @@ +; +; Supported features of the 'af_xdp' network poll mode driver. +; +; Refer to default.ini for the full list of available PMD features. +; +[Features] +Link status = Y +MTU update = Y +Promiscuous mode = Y +Stats per queue = Y +x86-64 = Y diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst index 5c80e3baa..a4b80a3d0 100644 --- a/doc/guides/nics/index.rst +++ b/doc/guides/nics/index.rst @@ -12,6 +12,7 @@ Network Interface Controller Drivers features build_and_test af_packet + af_xdp ark atlantic avp diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst index bdad1ddbe..79e36739f 100644 --- a/doc/guides/rel_notes/release_19_05.rst +++ b/doc/guides/rel_notes/release_19_05.rst @@ -74,6 +74,13 @@ New Features process. * Added support for Rx packet types list in a secondary process. +* **Added the AF_XDP PMD.** + + Added a Linux-specific PMD driver for AF_XDP, it can create the AF_XDP socket + and bind it to a specific netdev queue, it allows a DPDK application to send + and receive raw packets through the socket which would bypass the kernel + network stack to achieve high performance packet processing. + * **Updated Mellanox drivers.** New features and improvements were done in mlx4 and mlx5 PMDs: diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 502869a87..5d401b8c5 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -9,6 +9,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD),d) endif DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += af_packet +DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += af_xdp DIRS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += ark DIRS-$(CONFIG_RTE_LIBRTE_ATLANTIC_PMD) += atlantic DIRS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += avp diff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile new file mode 100644 index 000000000..8343e3016 --- /dev/null +++ b/drivers/net/af_xdp/Makefile @@ -0,0 +1,32 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2019 Intel Corporation + +include $(RTE_SDK)/mk/rte.vars.mk + +# +# library name +# +LIB = librte_pmd_af_xdp.a + +EXPORT_MAP := rte_pmd_af_xdp_version.map + +LIBABIVER := 1 + +CFLAGS += -O3 + +# require kernel version >= v5.1-rc1 +CFLAGS += -I$(RTE_KERNELDIR)/tools/include +CFLAGS += -I$(RTE_KERNELDIR)/tools/lib/bpf + +CFLAGS += $(WERROR_FLAGS) +LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring +LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs +LDLIBS += -lrte_bus_vdev +LDLIBS += -lbpf + +# +# all source are stored in SRCS-y +# +SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += rte_eth_af_xdp.c + +include $(RTE_SDK)/mk/rte.lib.mk diff --git a/drivers/net/af_xdp/meson.build b/drivers/net/af_xdp/meson.build new file mode 100644 index 000000000..d40aae190 --- /dev/null +++ b/drivers/net/af_xdp/meson.build @@ -0,0 +1,21 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2019 Intel Corporation + +if host_machine.system() != 'linux' + build = false +endif + +bpf_dep = dependency('libbpf', required: false) +if bpf_dep.found() + build = true +else + bpf_dep = cc.find_library('libbpf', required: false) + if bpf_dep.found() and cc.has_header('xsk.h', dependencies: bpf_dep) and cc.has_header('linux/if_xdp.h') + build = true + pkgconfig_extra_libs += '-lbpf' + else + build = false + endif +endif +sources = files('rte_eth_af_xdp.c') +ext_deps += bpf_dep diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c new file mode 100644 index 000000000..628b160a2 --- /dev/null +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c @@ -0,0 +1,956 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2019 Intel Corporation. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifndef SOL_XDP +#define SOL_XDP 283 +#endif + +#ifndef AF_XDP +#define AF_XDP 44 +#endif + +#ifndef PF_XDP +#define PF_XDP AF_XDP +#endif + +static int af_xdp_logtype; + +#define AF_XDP_LOG(level, fmt, args...) \ + rte_log(RTE_LOG_ ## level, af_xdp_logtype, \ + "%s(): " fmt, __func__, ##args) + +#define ETH_AF_XDP_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE +#define ETH_AF_XDP_NUM_BUFFERS 4096 +#define ETH_AF_XDP_DATA_HEADROOM 0 +#define ETH_AF_XDP_DFLT_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS +#define ETH_AF_XDP_DFLT_QUEUE_IDX 0 + +#define ETH_AF_XDP_RX_BATCH_SIZE 32 +#define ETH_AF_XDP_TX_BATCH_SIZE 32 + +#define ETH_AF_XDP_MAX_QUEUE_PAIRS 16 + +struct xsk_umem_info { + struct xsk_ring_prod fq; + struct xsk_ring_cons cq; + struct xsk_umem *umem; + struct rte_ring *buf_ring; + const struct rte_memzone *mz; +}; + +struct rx_stats { + uint64_t rx_pkts; + uint64_t rx_bytes; + uint64_t rx_dropped; +}; + +struct pkt_rx_queue { + struct xsk_ring_cons rx; + struct xsk_umem_info *umem; + struct xsk_socket *xsk; + struct rte_mempool *mb_pool; + + struct rx_stats stats; + + struct pkt_tx_queue *pair; + uint16_t queue_idx; +}; + +struct tx_stats { + uint64_t tx_pkts; + uint64_t err_pkts; + uint64_t tx_bytes; +}; + +struct pkt_tx_queue { + struct xsk_ring_prod tx; + + struct tx_stats stats; + + struct pkt_rx_queue *pair; + uint16_t queue_idx; +}; + +struct pmd_internals { + int if_index; + char if_name[IFNAMSIZ]; + uint16_t queue_idx; + struct ether_addr eth_addr; + struct xsk_umem_info *umem; + struct rte_mempool *mb_pool_share; + + struct pkt_rx_queue rx_queues[ETH_AF_XDP_MAX_QUEUE_PAIRS]; + struct pkt_tx_queue tx_queues[ETH_AF_XDP_MAX_QUEUE_PAIRS]; +}; + +#define ETH_AF_XDP_IFACE_ARG "iface" +#define ETH_AF_XDP_QUEUE_IDX_ARG "queue" + +static const char * const valid_arguments[] = { + ETH_AF_XDP_IFACE_ARG, + ETH_AF_XDP_QUEUE_IDX_ARG, + NULL +}; + +static const struct rte_eth_link pmd_link = { + .link_speed = ETH_SPEED_NUM_10G, + .link_duplex = ETH_LINK_FULL_DUPLEX, + .link_status = ETH_LINK_DOWN, + .link_autoneg = ETH_LINK_AUTONEG +}; + +static inline int +reserve_fill_queue(struct xsk_umem_info *umem, int reserve_size) +{ + struct xsk_ring_prod *fq = &umem->fq; + uint32_t idx; + int i, ret; + + ret = xsk_ring_prod__reserve(fq, reserve_size, &idx); + if (unlikely(!ret)) { + AF_XDP_LOG(ERR, "Failed to reserve enough fq descs.\n"); + return ret; + } + + for (i = 0; i < reserve_size; i++) { + __u64 *fq_addr; + void *addr = NULL; + if (rte_ring_dequeue(umem->buf_ring, &addr)) { + i--; + break; + } + fq_addr = xsk_ring_prod__fill_addr(fq, idx++); + *fq_addr = (uint64_t)addr; + } + + xsk_ring_prod__submit(fq, i); + + return 0; +} + +static uint16_t +eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) +{ + struct pkt_rx_queue *rxq = queue; + struct xsk_ring_cons *rx = &rxq->rx; + struct xsk_umem_info *umem = rxq->umem; + struct xsk_ring_prod *fq = &umem->fq; + uint32_t idx_rx; + uint32_t free_thresh = fq->size >> 1; + struct rte_mbuf *mbufs[ETH_AF_XDP_TX_BATCH_SIZE]; + unsigned long dropped = 0; + unsigned long rx_bytes = 0; + uint16_t count = 0; + int rcvd, i; + + nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE); + + rcvd = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx); + if (rcvd == 0) + return 0; + + if (xsk_prod_nb_free(fq, free_thresh) >= free_thresh) + (void)reserve_fill_queue(umem, ETH_AF_XDP_RX_BATCH_SIZE); + + if (unlikely(rte_pktmbuf_alloc_bulk(rxq->mb_pool, mbufs, rcvd) != 0)) + return 0; + + for (i = 0; i < rcvd; i++) { + const struct xdp_desc *desc; + uint64_t addr; + uint32_t len; + void *pkt; + + desc = xsk_ring_cons__rx_desc(rx, idx_rx++); + addr = desc->addr; + len = desc->len; + pkt = xsk_umem__get_data(rxq->umem->mz->addr, addr); + + rte_memcpy(rte_pktmbuf_mtod(mbufs[i], void *), pkt, len); + rte_pktmbuf_pkt_len(mbufs[i]) = len; + rte_pktmbuf_data_len(mbufs[i]) = len; + rx_bytes += len; + bufs[count++] = mbufs[i]; + + rte_ring_enqueue(umem->buf_ring, (void *)addr); + } + + xsk_ring_cons__release(rx, rcvd); + + /* statistics */ + rxq->stats.rx_pkts += (rcvd - dropped); + rxq->stats.rx_bytes += rx_bytes; + + return count; +} + +static void +pull_umem_cq(struct xsk_umem_info *umem, int size) +{ + struct xsk_ring_cons *cq = &umem->cq; + size_t i, n; + uint32_t idx_cq; + + n = xsk_ring_cons__peek(cq, size, &idx_cq); + + for (i = 0; i < n; i++) { + uint64_t addr; + addr = *xsk_ring_cons__comp_addr(cq, idx_cq++); + rte_ring_enqueue(umem->buf_ring, (void *)addr); + } + + xsk_ring_cons__release(cq, n); +} + +static void +kick_tx(struct pkt_tx_queue *txq) +{ + struct xsk_umem_info *umem = txq->pair->umem; + + while (send(xsk_socket__fd(txq->pair->xsk), NULL, + 0, MSG_DONTWAIT) < 0) { + /* some thing unexpected */ + if (errno != EBUSY && errno != EAGAIN && errno != EINTR) + break; + + /* pull from complete qeueu to leave more space */ + if (errno == EAGAIN) + pull_umem_cq(umem, ETH_AF_XDP_TX_BATCH_SIZE); + } + pull_umem_cq(umem, ETH_AF_XDP_TX_BATCH_SIZE); +} + +static uint16_t +eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) +{ + struct pkt_tx_queue *txq = queue; + struct xsk_umem_info *umem = txq->pair->umem; + struct rte_mbuf *mbuf; + void *addrs[ETH_AF_XDP_TX_BATCH_SIZE]; + unsigned long tx_bytes = 0; + int i, valid = 0; + uint32_t idx_tx; + + nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE); + + pull_umem_cq(umem, nb_pkts); + + nb_pkts = rte_ring_dequeue_bulk(umem->buf_ring, addrs, + nb_pkts, NULL); + if (nb_pkts == 0) + return 0; + + if (xsk_ring_prod__reserve(&txq->tx, nb_pkts, &idx_tx) != nb_pkts) { + kick_tx(txq); + return 0; + } + + for (i = 0; i < nb_pkts; i++) { + struct xdp_desc *desc; + void *pkt; + uint32_t buf_len = ETH_AF_XDP_FRAME_SIZE + - ETH_AF_XDP_DATA_HEADROOM; + desc = xsk_ring_prod__tx_desc(&txq->tx, idx_tx + i); + mbuf = bufs[i]; + if (mbuf->pkt_len <= buf_len) { + desc->addr = (uint64_t)addrs[valid]; + desc->len = mbuf->pkt_len; + pkt = xsk_umem__get_data(umem->mz->addr, + desc->addr); + rte_memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *), + desc->len); + valid++; + tx_bytes += mbuf->pkt_len; + } + rte_pktmbuf_free(mbuf); + } + + xsk_ring_prod__submit(&txq->tx, nb_pkts); + + kick_tx(txq); + + if (valid < nb_pkts) + rte_ring_enqueue_bulk(umem->buf_ring, &addrs[valid], + nb_pkts - valid, NULL); + + txq->stats.err_pkts += nb_pkts - valid; + txq->stats.tx_pkts += valid; + txq->stats.tx_bytes += tx_bytes; + + return nb_pkts; +} + +static int +eth_dev_start(struct rte_eth_dev *dev) +{ + dev->data->dev_link.link_status = ETH_LINK_UP; + + return 0; +} + +/* This function gets called when the current port gets stopped. */ +static void +eth_dev_stop(struct rte_eth_dev *dev) +{ + dev->data->dev_link.link_status = ETH_LINK_DOWN; +} + +static int +eth_dev_configure(struct rte_eth_dev *dev) +{ + /* rx/tx must be paired */ + if (dev->data->nb_rx_queues != dev->data->nb_tx_queues) + return -EINVAL; + + return 0; +} + +static void +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) +{ + struct pmd_internals *internals = dev->data->dev_private; + + dev_info->if_index = internals->if_index; + dev_info->max_mac_addrs = 1; + dev_info->max_rx_pktlen = ETH_FRAME_LEN; + dev_info->max_rx_queues = 1; + dev_info->max_tx_queues = 1; + + dev_info->default_rxportconf.nb_queues = 1; + dev_info->default_txportconf.nb_queues = 1; + dev_info->default_rxportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS; + dev_info->default_txportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS; +} + +static int +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats) +{ + struct pmd_internals *internals = dev->data->dev_private; + struct xdp_statistics xdp_stats; + struct pkt_rx_queue *rxq; + socklen_t optlen; + int i, ret; + + for (i = 0; i < dev->data->nb_rx_queues; i++) { + optlen = sizeof(struct xdp_statistics); + rxq = &internals->rx_queues[i]; + stats->q_ipackets[i] = internals->rx_queues[i].stats.rx_pkts; + stats->q_ibytes[i] = internals->rx_queues[i].stats.rx_bytes; + + stats->q_opackets[i] = internals->tx_queues[i].stats.tx_pkts; + stats->q_obytes[i] = internals->tx_queues[i].stats.tx_bytes; + + stats->ipackets += stats->q_ipackets[i]; + stats->ibytes += stats->q_ibytes[i]; + stats->imissed += internals->rx_queues[i].stats.rx_dropped; + ret = getsockopt(xsk_socket__fd(rxq->xsk), SOL_XDP, + XDP_STATISTICS, &xdp_stats, &optlen); + if (ret != 0) { + AF_XDP_LOG(ERR, "getsockopt() failed for XDP_STATISTICS.\n"); + return -1; + } + stats->imissed += xdp_stats.rx_dropped; + + stats->opackets += stats->q_opackets[i]; + stats->oerrors += internals->tx_queues[i].stats.err_pkts; + stats->obytes += stats->q_obytes[i]; + } + + return 0; +} + +static void +eth_stats_reset(struct rte_eth_dev *dev) +{ + struct pmd_internals *internals = dev->data->dev_private; + int i; + + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { + memset(&internals->rx_queues[i].stats, 0, + sizeof(struct rx_stats)); + memset(&internals->tx_queues[i].stats, 0, + sizeof(struct tx_stats)); + } +} + +static void +remove_xdp_program(struct pmd_internals *internals) +{ + uint32_t curr_prog_id = 0; + + if (bpf_get_link_xdp_id(internals->if_index, &curr_prog_id, + XDP_FLAGS_UPDATE_IF_NOEXIST)) { + AF_XDP_LOG(ERR, "bpf_get_link_xdp_id failed\n"); + return; + } + bpf_set_link_xdp_fd(internals->if_index, -1, + XDP_FLAGS_UPDATE_IF_NOEXIST); +} + +static void +eth_dev_close(struct rte_eth_dev *dev) +{ + struct pmd_internals *internals = dev->data->dev_private; + struct pkt_rx_queue *rxq; + int i; + + AF_XDP_LOG(INFO, "Closing AF_XDP ethdev on numa socket %u\n", + rte_socket_id()); + + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { + rxq = &internals->rx_queues[i]; + if (rxq->umem == NULL) + break; + xsk_socket__delete(rxq->xsk); + } + + (void)xsk_umem__delete(internals->umem->umem); + remove_xdp_program(internals); +} + +static void +eth_queue_release(void *q __rte_unused) +{ +} + +static int +eth_link_update(struct rte_eth_dev *dev __rte_unused, + int wait_to_complete __rte_unused) +{ + return 0; +} + +static void +xdp_umem_destroy(struct xsk_umem_info *umem) +{ + rte_memzone_free(umem->mz); + umem->mz = NULL; + + rte_ring_free(umem->buf_ring); + umem->buf_ring = NULL; + + rte_free(umem); + umem = NULL; +} + +static struct +xsk_umem_info *xdp_umem_configure(void) +{ + struct xsk_umem_info *umem; + const struct rte_memzone *mz; + struct xsk_umem_config usr_config = { + .fill_size = ETH_AF_XDP_DFLT_NUM_DESCS, + .comp_size = ETH_AF_XDP_DFLT_NUM_DESCS, + .frame_size = ETH_AF_XDP_FRAME_SIZE, + .frame_headroom = ETH_AF_XDP_DATA_HEADROOM }; + int ret; + uint64_t i; + + umem = rte_zmalloc_socket("umem", sizeof(*umem), 0, rte_socket_id()); + if (umem == NULL) { + AF_XDP_LOG(ERR, "Failed to allocate umem info"); + return NULL; + } + + umem->buf_ring = rte_ring_create("af_xdp_ring", + ETH_AF_XDP_NUM_BUFFERS, + rte_socket_id(), + 0x0); + if (umem->buf_ring == NULL) { + AF_XDP_LOG(ERR, "Failed to create rte_ring\n"); + goto err; + } + + for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++) + rte_ring_enqueue(umem->buf_ring, + (void *)(i * ETH_AF_XDP_FRAME_SIZE + + ETH_AF_XDP_DATA_HEADROOM)); + + mz = rte_memzone_reserve_aligned("af_xdp uemem", + ETH_AF_XDP_NUM_BUFFERS * ETH_AF_XDP_FRAME_SIZE, + rte_socket_id(), RTE_MEMZONE_IOVA_CONTIG, + getpagesize()); + if (mz == NULL) { + AF_XDP_LOG(ERR, "Failed to reserve memzone for af_xdp umem.\n"); + goto err; + } + + ret = xsk_umem__create(&umem->umem, mz->addr, + ETH_AF_XDP_NUM_BUFFERS * ETH_AF_XDP_FRAME_SIZE, + &umem->fq, &umem->cq, + &usr_config); + + if (ret) { + AF_XDP_LOG(ERR, "Failed to create umem"); + goto err; + } + umem->mz = mz; + + return umem; + +err: + xdp_umem_destroy(umem); + return NULL; +} + +static int +xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq, + int ring_size) +{ + struct xsk_socket_config cfg; + struct pkt_tx_queue *txq = rxq->pair; + int ret = 0; + int reserve_size; + + rxq->umem = xdp_umem_configure(); + if (rxq->umem == NULL) + return -ENOMEM; + + cfg.rx_size = ring_size; + cfg.tx_size = ring_size; + cfg.libbpf_flags = 0; + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST; + cfg.bind_flags = 0; + ret = xsk_socket__create(&rxq->xsk, internals->if_name, + internals->queue_idx, rxq->umem->umem, &rxq->rx, + &txq->tx, &cfg); + if (ret) { + AF_XDP_LOG(ERR, "Failed to create xsk socket.\n"); + goto err; + } + + reserve_size = ETH_AF_XDP_DFLT_NUM_DESCS / 2; + ret = reserve_fill_queue(rxq->umem, reserve_size); + if (ret) { + xsk_socket__delete(rxq->xsk); + AF_XDP_LOG(ERR, "Failed to reserve fill queue.\n"); + goto err; + } + + return 0; + +err: + xdp_umem_destroy(rxq->umem); + + return ret; +} + +static void +queue_reset(struct pmd_internals *internals, uint16_t queue_idx) +{ + struct pkt_rx_queue *rxq = &internals->rx_queues[queue_idx]; + struct pkt_tx_queue *txq = rxq->pair; + + memset(rxq, 0, sizeof(*rxq)); + memset(txq, 0, sizeof(*txq)); + rxq->pair = txq; + txq->pair = rxq; + rxq->queue_idx = queue_idx; + txq->queue_idx = queue_idx; +} + +static int +eth_rx_queue_setup(struct rte_eth_dev *dev, + uint16_t rx_queue_id, + uint16_t nb_rx_desc, + unsigned int socket_id __rte_unused, + const struct rte_eth_rxconf *rx_conf __rte_unused, + struct rte_mempool *mb_pool) +{ + struct pmd_internals *internals = dev->data->dev_private; + uint32_t buf_size, data_size; + struct pkt_rx_queue *rxq; + int ret; + + rxq = &internals->rx_queues[rx_queue_id]; + queue_reset(internals, rx_queue_id); + + /* Now get the space available for data in the mbuf */ + buf_size = rte_pktmbuf_data_room_size(mb_pool) - + RTE_PKTMBUF_HEADROOM; + data_size = ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_DATA_HEADROOM; + + if (data_size > buf_size) { + AF_XDP_LOG(ERR, "%s: %d bytes will not fit in mbuf (%d bytes)\n", + dev->device->name, data_size, buf_size); + ret = -ENOMEM; + goto err; + } + + rxq->mb_pool = mb_pool; + + if (xsk_configure(internals, rxq, nb_rx_desc)) { + AF_XDP_LOG(ERR, "Failed to configure xdp socket\n"); + ret = -EINVAL; + goto err; + } + + internals->umem = rxq->umem; + + dev->data->rx_queues[rx_queue_id] = rxq; + return 0; + +err: + queue_reset(internals, rx_queue_id); + return ret; +} + +static int +eth_tx_queue_setup(struct rte_eth_dev *dev, + uint16_t tx_queue_id, + uint16_t nb_tx_desc __rte_unused, + unsigned int socket_id __rte_unused, + const struct rte_eth_txconf *tx_conf __rte_unused) +{ + struct pmd_internals *internals = dev->data->dev_private; + struct pkt_tx_queue *txq; + + txq = &internals->tx_queues[tx_queue_id]; + + dev->data->tx_queues[tx_queue_id] = txq; + return 0; +} + +static int +eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) +{ + struct pmd_internals *internals = dev->data->dev_private; + struct ifreq ifr = { .ifr_mtu = mtu }; + int ret; + int s; + + s = socket(PF_INET, SOCK_DGRAM, 0); + if (s < 0) + return -EINVAL; + + strlcpy(ifr.ifr_name, internals->if_name, IFNAMSIZ); + ret = ioctl(s, SIOCSIFMTU, &ifr); + close(s); + + return (ret < 0) ? -errno : 0; +} + +static void +eth_dev_change_flags(char *if_name, uint32_t flags, uint32_t mask) +{ + struct ifreq ifr; + int s; + + s = socket(PF_INET, SOCK_DGRAM, 0); + if (s < 0) + return; + + strlcpy(ifr.ifr_name, if_name, IFNAMSIZ); + if (ioctl(s, SIOCGIFFLAGS, &ifr) < 0) + goto out; + ifr.ifr_flags &= mask; + ifr.ifr_flags |= flags; + if (ioctl(s, SIOCSIFFLAGS, &ifr) < 0) + goto out; +out: + close(s); +} + +static void +eth_dev_promiscuous_enable(struct rte_eth_dev *dev) +{ + struct pmd_internals *internals = dev->data->dev_private; + + eth_dev_change_flags(internals->if_name, IFF_PROMISC, ~0); +} + +static void +eth_dev_promiscuous_disable(struct rte_eth_dev *dev) +{ + struct pmd_internals *internals = dev->data->dev_private; + + eth_dev_change_flags(internals->if_name, 0, ~IFF_PROMISC); +} + +static const struct eth_dev_ops ops = { + .dev_start = eth_dev_start, + .dev_stop = eth_dev_stop, + .dev_close = eth_dev_close, + .dev_configure = eth_dev_configure, + .dev_infos_get = eth_dev_info, + .mtu_set = eth_dev_mtu_set, + .promiscuous_enable = eth_dev_promiscuous_enable, + .promiscuous_disable = eth_dev_promiscuous_disable, + .rx_queue_setup = eth_rx_queue_setup, + .tx_queue_setup = eth_tx_queue_setup, + .rx_queue_release = eth_queue_release, + .tx_queue_release = eth_queue_release, + .link_update = eth_link_update, + .stats_get = eth_stats_get, + .stats_reset = eth_stats_reset, +}; + +/** parse integer from integer argument */ +static int +parse_integer_arg(const char *key __rte_unused, + const char *value, void *extra_args) +{ + int *i = (int *)extra_args; + char *end; + + *i = strtol(value, &end, 10); + if (*i < 0) { + AF_XDP_LOG(ERR, "Argument has to be positive.\n"); + return -EINVAL; + } + + return 0; +} + +/** parse name argument */ +static int +parse_name_arg(const char *key __rte_unused, + const char *value, void *extra_args) +{ + char *name = extra_args; + + if (strnlen(value, IFNAMSIZ) > IFNAMSIZ - 1) { + AF_XDP_LOG(ERR, "Invalid name %s, should be less than %u bytes.\n", + value, IFNAMSIZ); + return -EINVAL; + } + + strlcpy(name, value, IFNAMSIZ); + + return 0; +} + +static int +parse_parameters(struct rte_kvargs *kvlist, + char *if_name, + int *queue_idx) +{ + int ret; + + ret = rte_kvargs_process(kvlist, ETH_AF_XDP_IFACE_ARG, + &parse_name_arg, if_name); + if (ret < 0) + goto free_kvlist; + + ret = rte_kvargs_process(kvlist, ETH_AF_XDP_QUEUE_IDX_ARG, + &parse_integer_arg, queue_idx); + if (ret < 0) + goto free_kvlist; + +free_kvlist: + rte_kvargs_free(kvlist); + return ret; +} + +static int +get_iface_info(const char *if_name, + struct ether_addr *eth_addr, + int *if_index) +{ + struct ifreq ifr; + int sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP); + + if (sock < 0) + return -1; + + strlcpy(ifr.ifr_name, if_name, IFNAMSIZ); + if (ioctl(sock, SIOCGIFINDEX, &ifr)) + goto error; + + *if_index = ifr.ifr_ifindex; + + if (ioctl(sock, SIOCGIFHWADDR, &ifr)) + goto error; + + rte_memcpy(eth_addr, ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN); + + close(sock); + return 0; + +error: + close(sock); + return -1; +} + +static struct rte_eth_dev * +init_internals(struct rte_vdev_device *dev, + const char *if_name, + int queue_idx) +{ + const char *name = rte_vdev_device_name(dev); + const unsigned int numa_node = dev->device.numa_node; + struct pmd_internals *internals; + struct rte_eth_dev *eth_dev; + int ret; + int i; + + internals = rte_zmalloc_socket(name, sizeof(*internals), 0, numa_node); + if (internals == NULL) + return NULL; + + internals->queue_idx = queue_idx; + strlcpy(internals->if_name, if_name, IFNAMSIZ); + + for (i = 0; i < ETH_AF_XDP_MAX_QUEUE_PAIRS; i++) { + internals->tx_queues[i].pair = &internals->rx_queues[i]; + internals->rx_queues[i].pair = &internals->tx_queues[i]; + } + + ret = get_iface_info(if_name, &internals->eth_addr, + &internals->if_index); + if (ret) + goto err; + + eth_dev = rte_eth_vdev_allocate(dev, 0); + if (eth_dev == NULL) + goto err; + + eth_dev->data->dev_private = internals; + eth_dev->data->dev_link = pmd_link; + eth_dev->data->mac_addrs = &internals->eth_addr; + eth_dev->dev_ops = &ops; + eth_dev->rx_pkt_burst = eth_af_xdp_rx; + eth_dev->tx_pkt_burst = eth_af_xdp_tx; + + return eth_dev; + +err: + rte_free(internals); + return NULL; +} + +static int +rte_pmd_af_xdp_probe(struct rte_vdev_device *dev) +{ + struct rte_kvargs *kvlist; + char if_name[IFNAMSIZ] = {'\0'}; + int xsk_queue_idx = ETH_AF_XDP_DFLT_QUEUE_IDX; + struct rte_eth_dev *eth_dev = NULL; + const char *name; + + AF_XDP_LOG(INFO, "Initializing pmd_af_xdp for %s\n", + rte_vdev_device_name(dev)); + + name = rte_vdev_device_name(dev); + if (rte_eal_process_type() == RTE_PROC_SECONDARY && + strlen(rte_vdev_device_args(dev)) == 0) { + eth_dev = rte_eth_dev_attach_secondary(name); + if (eth_dev == NULL) { + AF_XDP_LOG(ERR, "Failed to probe %s\n", name); + return -EINVAL; + } + eth_dev->dev_ops = &ops; + rte_eth_dev_probing_finish(eth_dev); + return 0; + } + + kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments); + if (kvlist == NULL) { + AF_XDP_LOG(ERR, "Invalid kvargs key\n"); + return -EINVAL; + } + + if (dev->device.numa_node == SOCKET_ID_ANY) + dev->device.numa_node = rte_socket_id(); + + if (parse_parameters(kvlist, if_name, &xsk_queue_idx) < 0) { + AF_XDP_LOG(ERR, "Invalid kvargs value\n"); + return -EINVAL; + } + + if (strlen(if_name) == 0) { + AF_XDP_LOG(ERR, "Network interface must be specified\n"); + return -EINVAL; + } + + eth_dev = init_internals(dev, if_name, xsk_queue_idx); + if (eth_dev == NULL) { + AF_XDP_LOG(ERR, "Failed to init internals\n"); + return -1; + } + + rte_eth_dev_probing_finish(eth_dev); + + return 0; +} + +static int +rte_pmd_af_xdp_remove(struct rte_vdev_device *dev) +{ + struct rte_eth_dev *eth_dev = NULL; + struct pmd_internals *internals; + + AF_XDP_LOG(INFO, "Removing AF_XDP ethdev on numa socket %u\n", + rte_socket_id()); + + if (dev == NULL) + return -1; + + /* find the ethdev entry */ + eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev)); + if (eth_dev == NULL) + return -1; + + internals = eth_dev->data->dev_private; + + rte_ring_free(internals->umem->buf_ring); + rte_memzone_free(internals->umem->mz); + rte_free(internals->umem); + + rte_eth_dev_release_port(eth_dev); + + + return 0; +} + +static struct rte_vdev_driver pmd_af_xdp_drv = { + .probe = rte_pmd_af_xdp_probe, + .remove = rte_pmd_af_xdp_remove, +}; + +RTE_PMD_REGISTER_VDEV(net_af_xdp, pmd_af_xdp_drv); +RTE_PMD_REGISTER_PARAM_STRING(net_af_xdp, + "iface= " + "queue= "); + +RTE_INIT(af_xdp_init_log) +{ + af_xdp_logtype = rte_log_register("pmd.net.af_xdp"); + if (af_xdp_logtype >= 0) + rte_log_set_level(af_xdp_logtype, RTE_LOG_NOTICE); +} diff --git a/drivers/net/af_xdp/rte_pmd_af_xdp_version.map b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map new file mode 100644 index 000000000..c6db030fe --- /dev/null +++ b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map @@ -0,0 +1,3 @@ +DPDK_19.05 { + local: *; +}; diff --git a/drivers/net/meson.build b/drivers/net/meson.build index 3ecc78cee..1105e72d8 100644 --- a/drivers/net/meson.build +++ b/drivers/net/meson.build @@ -2,6 +2,7 @@ # Copyright(c) 2017 Intel Corporation drivers = ['af_packet', + 'af_xdp', 'ark', 'atlantic', 'avp', diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 262132fc6..f916bc9ef 100644 --- a/mk/rte.app.mk +++ b/mk/rte.app.mk @@ -143,6 +143,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += -lrte_mempool_dpaa2 endif _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += -lrte_pmd_af_packet +_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += -lrte_pmd_af_xdp -lbpf _LDLIBS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += -lrte_pmd_ark _LDLIBS-$(CONFIG_RTE_LIBRTE_ATLANTIC_PMD) += -lrte_pmd_atlantic _LDLIBS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += -lrte_pmd_avp