[dpdk-dev,v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices

Message ID 1405362290-6753-1-git-send-email-linville@tuxdriver.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

John W. Linville July 14, 2014, 6:24 p.m. UTC
  This is a Linux-specific virtual PMD driver backed by an AF_PACKET
socket.  This implementation uses mmap'ed ring buffers to limit copying
and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
AF_PACKET is used for frame reception.  In the current implementation,
Tx and Rx queues are always paired, and therefore are always equal
in number -- changing this would be a Simple Matter Of Programming.

Interfaces of this type are created with a command line option like
"--vdev=eth_packet0,iface=...".  There are a number of options availabe
as arguments:

 - Interface is chosen by "iface" (required)
 - Number of queue pairs set by "qpairs" (optional, default: 1)
 - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
 - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
 - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)

Signed-off-by: John W. Linville <linville@tuxdriver.com>
---
This PMD is intended to provide a means for using DPDK on a broad
range of hardware without hardware-specific PMDs and (hopefully)
with better performance than what PCAP offers in Linux.  This might
be useful as a development platform for DPDK applications when
DPDK-supported hardware is expensive or unavailable.

New in v2:

-- fixup some style issues found by check patch
-- use if_index as part of fanout group ID
-- set default number of queue pairs to 1

 config/common_bsdapp                   |   5 +
 config/common_linuxapp                 |   5 +
 lib/Makefile                           |   1 +
 lib/librte_eal/linuxapp/eal/Makefile   |   1 +
 lib/librte_pmd_packet/Makefile         |  60 +++
 lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
 lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
 mk/rte.app.mk                          |   4 +
 8 files changed, 957 insertions(+)
 create mode 100644 lib/librte_pmd_packet/Makefile
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
 create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
  

Comments

Zhou, Danny July 15, 2014, 12:15 a.m. UTC | #1
According to my performance measurement results for 64B small packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M pps) which make sense to me as for 16 queues case more CPU cycles (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled ixgbe driver to switch between polling and interrupt modes in order to service per-queue rx interrupts, so more context switch overhead involved. Also, since the eth_packet_rx/eth_packet_tx routines involves in two memory copies between DPDK mbuf and pbuf for each packet, it can hardly achieve high performance unless packet are directly DMA to mbuf which needs ixgbe driver to support.

> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Tuesday, July 15, 2014 2:25 AM
> To: dev@dpdk.org
> Cc: Thomas Monjalon; Richardson, Bruce; Zhou, Danny
> Subject: [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual
> devices
> 
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET socket.  This
> implementation uses mmap'ed ring buffers to limit copying and user/kernel
> transitions.  The PACKET_FANOUT_HASH behavior of AF_PACKET is used for
> frame reception.  In the current implementation, Tx and Rx queues are always paired,
> and therefore are always equal in number -- changing this would be a Simple Matter
> Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe as
> arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 1)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad range of
> hardware without hardware-specific PMDs and (hopefully) with better performance
> than what PCAP offers in Linux.  This might be useful as a development platform for
> DPDK applications when DPDK-supported hardware is expensive or unavailable.
> 
> New in v2:
> 
> -- fixup some style issues found by check patch
> -- use if_index as part of fanout group ID
> -- set default number of queue pairs to 1
> 
>  config/common_bsdapp                   |   5 +
>  config/common_linuxapp                 |   5 +
>  lib/Makefile                           |   1 +
>  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
>  lib/librte_pmd_packet/Makefile         |  60 +++
>  lib/librte_pmd_packet/rte_eth_packet.c | 826
> +++++++++++++++++++++++++++++++++
> lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
>  mk/rte.app.mk                          |   4 +
>  8 files changed, 957 insertions(+)
>  create mode 100644 lib/librte_pmd_packet/Makefile  create mode 100644
> lib/librte_pmd_packet/rte_eth_packet.c
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> 
> diff --git a/config/common_bsdapp b/config/common_bsdapp index
> 943dce8f1ede..c317f031278e 100644
> --- a/config/common_bsdapp
> +++ b/config/common_bsdapp
> @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> CONFIG_RTE_LIBRTE_PMD_BOND=y
> 
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> +
> +#
>  # Do prefetch of packet data within PMD driver receive function  #
> CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git a/config/common_linuxapp
> b/config/common_linuxapp index 7bf5d80d4e26..f9e7bc3015ec 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> CONFIG_RTE_LIBRTE_PMD_BOND=y
> 
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> +
> +#
>  # Compile Xen PMD
>  #
>  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> diff --git a/lib/Makefile b/lib/Makefile index 10c5bb3045bc..930fadf29898 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> librte_pmd_i40e
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
>  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
>  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff --git
> a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> index 756d6b0c9301..feed24a63272 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS +=
> -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
>  CFLAGS += $(WERROR_FLAGS) -O3
> 
> diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile new file
> mode 100644 index 000000000000..e1266fb992cd
> --- /dev/null
> +++ b/lib/librte_pmd_packet/Makefile
> @@ -0,0 +1,60 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_pmd_packet.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +#
> +# all source are stored in SRCS-y
> +#
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> +
> +#
> +# Export include files
> +#
> +SYMLINK-y-include += rte_eth_packet.h
> +
> +# this lib depends upon:
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> b/lib/librte_pmd_packet/rte_eth_packet.c
> new file mode 100644
> index 000000000000..9c82d16e730f
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> @@ -0,0 +1,826 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> + *
> + *   Originally based upon librte_pmd_pcap code:
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2014 6WIND S.A.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> + */
> +
> +#include <rte_mbuf.h>
> +#include <rte_ethdev.h>
> +#include <rte_malloc.h>
> +#include <rte_kvargs.h>
> +#include <rte_dev.h>
> +
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <arpa/inet.h>
> +#include <net/if.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <poll.h>
> +
> +#include "rte_eth_packet.h"
> +
> +#define ETH_PACKET_IFACE_ARG		"iface"
> +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> +
> +#define DFLT_BLOCK_SIZE		(1 << 12)
> +#define DFLT_FRAME_SIZE		(1 << 11)
> +#define DFLT_FRAME_COUNT	(1 << 9)
> +
> +struct pkt_rx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	struct rte_mempool *mb_pool;
> +
> +	volatile unsigned long rx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pkt_tx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	volatile unsigned long tx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pmd_internals {
> +	unsigned nb_queues;
> +
> +	int if_index;
> +	struct ether_addr eth_addr;
> +
> +	struct tpacket_req req;
> +
> +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +};
> +
> +static const char *valid_arguments[] = {
> +	ETH_PACKET_IFACE_ARG,
> +	ETH_PACKET_NUM_Q_ARG,
> +	ETH_PACKET_BLOCKSIZE_ARG,
> +	ETH_PACKET_FRAMESIZE_ARG,
> +	ETH_PACKET_FRAMECOUNT_ARG,
> +	NULL
> +};
> +
> +static const char *drivername = "AF_PACKET PMD";
> +
> +static struct rte_eth_link pmd_link = {
> +	.link_speed = 10000,
> +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> +	.link_status = 0
> +};
> +
> +static uint16_t
> +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> +	unsigned i;
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	struct pkt_rx_queue *pkt_q = queue;
> +	uint16_t num_rx = 0;
> +	unsigned int framecount, framenum;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	/*
> +	 * Reads the given number of packets from the AF_PACKET socket one by
> +	 * one and copies the packet data into a newly allocated mbuf.
> +	 */
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> +			break;
> +
> +		/* allocate the next mbuf */
> +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> +		if (unlikely(mbuf == NULL))
> +			break;
> +
> +		/* packet will fit in the mbuf, go ahead and receive it */
> +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_KERNEL;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +
> +		/* account for the receive frame */
> +		bufs[i] = mbuf;
> +		num_rx++;
> +	}
> +	pkt_q->framenum = framenum;
> +	pkt_q->rx_pkts += num_rx;
> +	return num_rx;
> +}
> +
> +/*
> + * Callback to handle sending packets through a real NIC.
> + */
> +static uint16_t
> +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	unsigned int framecount, framenum;
> +	struct pollfd pfd;
> +	struct pkt_tx_queue *pkt_q = queue;
> +	uint16_t num_tx = 0;
> +	int i;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	memset(&pfd, 0, sizeof(pfd));
> +	pfd.fd = pkt_q->sockfd;
> +	pfd.events = POLLOUT;
> +	pfd.revents = 0;
> +
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> +		    (poll(&pfd, 1, -1) < 0))
> +				continue;
> +
> +		/* copy the tx frame data */
> +		mbuf = bufs[num_tx];
> +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> +			sizeof(struct sockaddr_ll);
> +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +
> +		num_tx++;
> +		rte_pktmbuf_free(mbuf);
> +	}
> +
> +	/* kick-off transmits */
> +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +
> +	pkt_q->framenum = framenum;
> +	pkt_q->tx_pkts += num_tx;
> +	pkt_q->err_pkts += nb_pkts - num_tx;
> +	return num_tx;
> +}
> +
> +static int
> +eth_dev_start(struct rte_eth_dev *dev)
> +{
> +	dev->data->dev_link.link_status = 1;
> +	return 0;
> +}
> +
> +/*
> + * This function gets called when the current port gets stopped.
> + */
> +static void
> +eth_dev_stop(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	int sockfd;
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	for (i = 0; i < internals->nb_queues; i++) {
> +		sockfd = internals->rx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +		sockfd = internals->tx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +	}
> +
> +	dev->data->dev_link.link_status = 0;
> +}
> +
> +static int
> +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) {
> +	return 0;
> +}
> +
> +static void
> +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> +*dev_info) {
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev_info->driver_name = drivername;
> +	dev_info->if_index = internals->if_index;
> +	dev_info->max_mac_addrs = 1;
> +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->min_rx_bufsize = 0;
> +	dev_info->pci_dev = NULL;
> +}
> +
> +static void
> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> +{
> +	unsigned i, imax;
> +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> +	const struct pmd_internals *internal = dev->data->dev_private;
> +
> +	memset(igb_stats, 0, sizeof(*igb_stats));
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> +		rx_total += igb_stats->q_ipackets[i];
> +	}
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> +		tx_total += igb_stats->q_opackets[i];
> +		tx_err_total += igb_stats->q_errors[i];
> +	}
> +
> +	igb_stats->ipackets = rx_total;
> +	igb_stats->opackets = tx_total;
> +	igb_stats->oerrors = tx_err_total;
> +}
> +
> +static void
> +eth_stats_reset(struct rte_eth_dev *dev) {
> +	unsigned i;
> +	struct pmd_internals *internal = dev->data->dev_private;
> +
> +	for (i = 0; i < internal->nb_queues; i++)
> +		internal->rx_queue[i].rx_pkts = 0;
> +
> +	for (i = 0; i < internal->nb_queues; i++) {
> +		internal->tx_queue[i].tx_pkts = 0;
> +		internal->tx_queue[i].err_pkts = 0;
> +	}
> +}
> +
> +static void
> +eth_dev_close(struct rte_eth_dev *dev __rte_unused) { }
> +
> +static void
> +eth_queue_release(void *q __rte_unused) { }
> +
> +static int
> +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> +                int wait_to_complete __rte_unused) {
> +	return 0;
> +}
> +
> +static int
> +eth_rx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t rx_queue_id,
> +                   uint16_t nb_rx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> +                   struct rte_mempool *mb_pool) {
> +	struct pmd_internals *internals = dev->data->dev_private;
> +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> +	struct rte_pktmbuf_pool_private *mbp_priv;
> +	uint16_t buf_size;
> +
> +	pkt_q->mb_pool = mb_pool;
> +
> +	/* Now get the space available for data in the mbuf */
> +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> +	                       RTE_PKTMBUF_HEADROOM);
> +
> +	if (ETH_FRAME_LEN > buf_size) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> +			dev->data->name, ETH_FRAME_LEN, buf_size);
> +		return -ENOMEM;
> +	}
> +
> +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> +
> +	return 0;
> +}
> +
> +static int
> +eth_tx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t tx_queue_id,
> +                   uint16_t nb_tx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_txconf *tx_conf __rte_unused) {
> +
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> +	return 0;
> +}
> +
> +static struct eth_dev_ops ops = {
> +	.dev_start = eth_dev_start,
> +	.dev_stop = eth_dev_stop,
> +	.dev_close = eth_dev_close,
> +	.dev_configure = eth_dev_configure,
> +	.dev_infos_get = eth_dev_info,
> +	.rx_queue_setup = eth_rx_queue_setup,
> +	.tx_queue_setup = eth_tx_queue_setup,
> +	.rx_queue_release = eth_queue_release,
> +	.tx_queue_release = eth_queue_release,
> +	.link_update = eth_link_update,
> +	.stats_get = eth_stats_get,
> +	.stats_reset = eth_stats_reset,
> +};
> +
> +/*
> + * Opens an AF_PACKET socket
> + */
> +static int
> +open_packet_iface(const char *key __rte_unused,
> +                  const char *value __rte_unused,
> +                  void *extra_args)
> +{
> +	int *sockfd = extra_args;
> +
> +	/* Open an AF_PACKET socket... */
> +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +	if (*sockfd == -1) {
> +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_pmd_init_internals(const char *name,
> +                       const int sockfd,
> +                       const unsigned nb_queues,
> +                       unsigned int blocksize,
> +                       unsigned int blockcnt,
> +                       unsigned int framesize,
> +                       unsigned int framecnt,
> +                       const unsigned numa_node,
> +                       struct pmd_internals **internals,
> +                       struct rte_eth_dev **eth_dev,
> +                       struct rte_kvargs *kvlist) {
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_device *pci_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	struct ifreq ifr;
> +	size_t ifnamelen;
> +	unsigned k_idx;
> +	struct sockaddr_ll sockaddr;
> +	struct tpacket_req *req;
> +	struct pkt_rx_queue *rx_queue;
> +	struct pkt_tx_queue *tx_queue;
> +	int rc, tpver, discard, bypass;
> +	unsigned int i, q, rdsize;
> +	int qsockfd, fanout_arg;
> +
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> +			break;
> +	}
> +	if (pair == NULL) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: no interface specified for AF_PACKET ethdev\n",
> +		        name);
> +		goto error;
> +	}
> +
> +	RTE_LOG(INFO, PMD,
> +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> +		name, numa_node);
> +
> +	/*
> +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> +	 * and internal (private) data
> +	 */
> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> +	if (data == NULL)
> +		goto error;
> +
> +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> +	if (pci_dev == NULL)
> +		goto error;
> +
> +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> +	                                0, numa_node);
> +	if (*internals == NULL)
> +		goto error;
> +
> +	req = &((*internals)->req);
> +
> +	req->tp_block_size = blocksize;
> +	req->tp_block_nr = blockcnt;
> +	req->tp_frame_size = framesize;
> +	req->tp_frame_nr = framecnt;
> +
> +	ifnamelen = strlen(pair->value);
> +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> +		ifr.ifr_name[ifnamelen] = '\0';
> +	} else {
> +		RTE_LOG(ERR, PMD,
> +			"%s: I/F name too long (%s)\n",
> +			name, pair->value);
> +		goto error;
> +	}
> +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> +		        name);
> +		goto error;
> +	}
> +	(*internals)->if_index = ifr.ifr_ifindex;
> +
> +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> +		        name);
> +		goto error;
> +	}
> +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> +
> +	memset(&sockaddr, 0, sizeof(sockaddr));
> +	sockaddr.sll_family = AF_PACKET;
> +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> +	sockaddr.sll_ifindex = (*internals)->if_index;
> +
> +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> +
> +	for (q = 0; q < nb_queues; q++) {
> +		/* Open an AF_PACKET socket for this queue... */
> +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +		if (qsockfd == -1) {
> +			RTE_LOG(ERR, PMD,
> +			        "%s: could not open AF_PACKET socket\n",
> +			        name);
> +			return -1;
> +		}
> +
> +		tpver = TPACKET_V2;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> +				&tpver, sizeof(tpver));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_VERSION on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		discard = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> +				&discard, sizeof(discard));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_LOSS on "
> +			        "AF_PACKET socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		bypass = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> +				&bypass, sizeof(bypass));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_QDISC_BYPASS "
> +			        "on AF_PACKET socket for %s\n", name,
> +			        pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req,
> sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req,
> sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rx_queue = &((*internals)->rx_queue[q]);
> +		rx_queue->framecount = req->tp_frame_nr;
> +
> +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> +				    PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_LOCKED,
> +				    qsockfd, 0);
> +		if (rx_queue->map == MAP_FAILED) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> +				name, pair->value);
> +			goto error;
> +		}
> +
> +		/* rdsize is same for both Tx and Rx */
> +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> +
> +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		rx_queue->sockfd = qsockfd;
> +
> +		tx_queue = &((*internals)->tx_queue[q]);
> +		tx_queue->framecount = req->tp_frame_nr;
> +
> +		tx_queue->map = rx_queue->map + req->tp_block_size *
> +req->tp_block_nr;
> +
> +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		tx_queue->sockfd = qsockfd;
> +
> +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not bind AF_PACKET socket to %s\n",
> +			        name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> +				&fanout_arg, sizeof(fanout_arg));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> +				"for %s\n", name, pair->value);
> +			goto error;
> +		}
> +	}
> +
> +	/* reserve an ethdev entry */
> +	*eth_dev = rte_eth_dev_allocate(name);
> +	if (*eth_dev == NULL)
> +		goto error;
> +
> +	/*
> +	 * now put it all together
> +	 * - store queue data in internals,
> +	 * - store numa_node info in pci_driver
> +	 * - point eth_dev_data to internals and pci_driver
> +	 * - and point eth_dev structure to new eth_dev_data structure
> +	 */
> +
> +	(*internals)->nb_queues = nb_queues;
> +
> +	data->dev_private = *internals;
> +	data->port_id = (*eth_dev)->data->port_id;
> +	data->nb_rx_queues = (uint16_t)nb_queues;
> +	data->nb_tx_queues = (uint16_t)nb_queues;
> +	data->dev_link = pmd_link;
> +	data->mac_addrs = &(*internals)->eth_addr;
> +
> +	pci_dev->numa_node = numa_node;
> +
> +	(*eth_dev)->data = data;
> +	(*eth_dev)->dev_ops = &ops;
> +	(*eth_dev)->pci_dev = pci_dev;
> +
> +	return 0;
> +
> +error:
> +	if (data)
> +		rte_free(data);
> +	if (pci_dev)
> +		rte_free(pci_dev);
> +	for (q = 0; q < nb_queues; q++) {
> +		if ((*internals)->rx_queue[q].rd)
> +			rte_free((*internals)->rx_queue[q].rd);
> +		if ((*internals)->tx_queue[q].rd)
> +			rte_free((*internals)->tx_queue[q].rd);
> +	}
> +	if (*internals)
> +		rte_free(*internals);
> +	return -1;
> +}
> +
> +static int
> +rte_eth_from_packet(const char *name,
> +                    int const *sockfd,
> +                    const unsigned numa_node,
> +                    struct rte_kvargs *kvlist) {
> +	struct pmd_internals *internals = NULL;
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	unsigned k_idx;
> +	unsigned int blockcount;
> +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> +	unsigned int framesize = DFLT_FRAME_SIZE;
> +	unsigned int framecount = DFLT_FRAME_COUNT;
> +	unsigned int qpairs = 1;
> +
> +	/* do some parameter checking */
> +	if (*sockfd < 0)
> +		return -1;
> +
> +	/*
> +	 * Walk arguments for configurable settings
> +	 */
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> +			qpairs = atoi(pair->value);
> +			if (qpairs < 1 ||
> +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid qpairs value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> +			blocksize = atoi(pair->value);
> +			if (!blocksize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid blocksize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> +			framesize = atoi(pair->value);
> +			if (!framesize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framesize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> +			framecount = atoi(pair->value);
> +			if (!framecount) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framecount value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +	}
> +
> +	if (framesize > blocksize) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> +		        name);
> +		return -1;
> +	}
> +
> +	blockcount = framecount / (blocksize / framesize);
> +	if (!blockcount) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> +		return -1;
> +	}
> +
> +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> +
> +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> +	                           blocksize, blockcount,
> +	                           framesize, framecount,
> +	                           numa_node, &internals, &eth_dev,
> +	                           kvlist) < 0)
> +		return -1;
> +
> +	eth_dev->rx_pkt_burst = eth_packet_rx;
> +	eth_dev->tx_pkt_burst = eth_packet_tx;
> +
> +	return 0;
> +}
> +
> +int
> +rte_pmd_packet_devinit(const char *name, const char *params) {
> +	unsigned numa_node;
> +	int ret;
> +	struct rte_kvargs *kvlist;
> +	int sockfd = -1;
> +
> +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> +
> +	numa_node = rte_socket_id();
> +
> +	kvlist = rte_kvargs_parse(params, valid_arguments);
> +	if (kvlist == NULL)
> +		return -1;
> +
> +	/*
> +	 * If iface argument is passed we open the NICs and use them for
> +	 * reading / writing
> +	 */
> +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> +
> +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> +		                         &open_packet_iface, &sockfd);
> +		if (ret < 0)
> +			return -1;
> +	}
> +
> +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> +	close(sockfd); /* no longer needed */
> +
> +	if (ret < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static struct rte_driver pmd_packet_drv = {
> +	.name = "eth_packet",
> +	.type = PMD_VDEV,
> +	.init = rte_pmd_packet_devinit,
> +};
> +
> +PMD_REGISTER_DRIVER(pmd_packet_drv);
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.h
> b/lib/librte_pmd_packet/rte_eth_packet.h
> new file mode 100644
> index 000000000000..f685611da3e9
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> @@ -0,0 +1,55 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> + */
> +
> +#ifndef _RTE_ETH_PACKET_H_
> +#define _RTE_ETH_PACKET_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> +
> +#define RTE_PMD_PACKET_MAX_RINGS 16
> +
> +/**
> + * For use by the EAL only. Called as part of EAL init to set up any
> +dummy NICs
> + * configured on command line.
> + */
> +int rte_pmd_packet_devinit(const char *name, const char *params);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 34dff2a02a05..a6994c4dbe93
> 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)  LDLIBS
> += -lrte_pmd_pcap -lpcap  endif
> 
> +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> +LDLIBS += -lrte_pmd_packet
> +endif
> +
>  endif # plugins
> 
>  LDLIBS += $(EXECENV_LDLIBS)
> --
> 1.9.3
  
Neil Horman July 15, 2014, 12:17 p.m. UTC | #2
On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> According to my performance measurement results for 64B small packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M pps) which make sense to me as for 16 queues case more CPU cycles (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled ixgbe driver to switch between polling and interrupt modes in order to service per-queue rx interrupts, so more context switch overhead involved. Also, since the eth_packet_rx/eth_packet_tx routines involves in two memory copies between DPDK mbuf and pbuf for each packet, it can hardly achieve high performance unless packet are directly DMA to mbuf which needs ixgbe driver to support.

I thought 16 queues would be spread out between as many cpus as you had though,
obviating the need for context switches, no?
Neil

> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Tuesday, July 15, 2014 2:25 AM
> > To: dev@dpdk.org
> > Cc: Thomas Monjalon; Richardson, Bruce; Zhou, Danny
> > Subject: [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual
> > devices
> > 
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET socket.  This
> > implementation uses mmap'ed ring buffers to limit copying and user/kernel
> > transitions.  The PACKET_FANOUT_HASH behavior of AF_PACKET is used for
> > frame reception.  In the current implementation, Tx and Rx queues are always paired,
> > and therefore are always equal in number -- changing this would be a Simple Matter
> > Of Programming.
> > 
> > Interfaces of this type are created with a command line option like
> > "--vdev=eth_packet0,iface=...".  There are a number of options availabe as
> > arguments:
> > 
> >  - Interface is chosen by "iface" (required)
> >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > 
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > This PMD is intended to provide a means for using DPDK on a broad range of
> > hardware without hardware-specific PMDs and (hopefully) with better performance
> > than what PCAP offers in Linux.  This might be useful as a development platform for
> > DPDK applications when DPDK-supported hardware is expensive or unavailable.
> > 
> > New in v2:
> > 
> > -- fixup some style issues found by check patch
> > -- use if_index as part of fanout group ID
> > -- set default number of queue pairs to 1
> > 
> >  config/common_bsdapp                   |   5 +
> >  config/common_linuxapp                 |   5 +
> >  lib/Makefile                           |   1 +
> >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> >  lib/librte_pmd_packet/Makefile         |  60 +++
> >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> > +++++++++++++++++++++++++++++++++
> > lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> >  mk/rte.app.mk                          |   4 +
> >  8 files changed, 957 insertions(+)
> >  create mode 100644 lib/librte_pmd_packet/Makefile  create mode 100644
> > lib/librte_pmd_packet/rte_eth_packet.c
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > 
> > diff --git a/config/common_bsdapp b/config/common_bsdapp index
> > 943dce8f1ede..c317f031278e 100644
> > --- a/config/common_bsdapp
> > +++ b/config/common_bsdapp
> > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > 
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > +
> > +#
> >  # Do prefetch of packet data within PMD driver receive function  #
> > CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git a/config/common_linuxapp
> > b/config/common_linuxapp index 7bf5d80d4e26..f9e7bc3015ec 100644
> > --- a/config/common_linuxapp
> > +++ b/config/common_linuxapp
> > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > 
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > +
> > +#
> >  # Compile Xen PMD
> >  #
> >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > diff --git a/lib/Makefile b/lib/Makefile index 10c5bb3045bc..930fadf29898 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> > librte_pmd_i40e
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff --git
> > a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > index 756d6b0c9301..feed24a63272 100644
> > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS +=
> > -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> >  CFLAGS += $(WERROR_FLAGS) -O3
> > 
> > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile new file
> > mode 100644 index 000000000000..e1266fb992cd
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/Makefile
> > @@ -0,0 +1,60 @@
> > +#   BSD LICENSE
> > +#
> > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > +#   Copyright(c) 2014 6WIND S.A.
> > +#   All rights reserved.
> > +#
> > +#   Redistribution and use in source and binary forms, with or without
> > +#   modification, are permitted provided that the following conditions
> > +#   are met:
> > +#
> > +#     * Redistributions of source code must retain the above copyright
> > +#       notice, this list of conditions and the following disclaimer.
> > +#     * Redistributions in binary form must reproduce the above copyright
> > +#       notice, this list of conditions and the following disclaimer in
> > +#       the documentation and/or other materials provided with the
> > +#       distribution.
> > +#     * Neither the name of Intel Corporation nor the names of its
> > +#       contributors may be used to endorse or promote products derived
> > +#       from this software without specific prior written permission.
> > +#
> > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > NOT
> > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> > THE USE
> > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +#
> > +# library name
> > +#
> > +LIB = librte_pmd_packet.a
> > +
> > +CFLAGS += -O3
> > +CFLAGS += $(WERROR_FLAGS)
> > +
> > +#
> > +# all source are stored in SRCS-y
> > +#
> > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > +
> > +#
> > +# Export include files
> > +#
> > +SYMLINK-y-include += rte_eth_packet.h
> > +
> > +# this lib depends upon:
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> > b/lib/librte_pmd_packet/rte_eth_packet.c
> > new file mode 100644
> > index 000000000000..9c82d16e730f
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > @@ -0,0 +1,826 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > + *
> > + *   Originally based upon librte_pmd_pcap code:
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   Copyright(c) 2014 6WIND S.A.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> > THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > + */
> > +
> > +#include <rte_mbuf.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_malloc.h>
> > +#include <rte_kvargs.h>
> > +#include <rte_dev.h>
> > +
> > +#include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> > +#include <arpa/inet.h>
> > +#include <net/if.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <poll.h>
> > +
> > +#include "rte_eth_packet.h"
> > +
> > +#define ETH_PACKET_IFACE_ARG		"iface"
> > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > +
> > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > +#define DFLT_FRAME_SIZE		(1 << 11)
> > +#define DFLT_FRAME_COUNT	(1 << 9)
> > +
> > +struct pkt_rx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	struct rte_mempool *mb_pool;
> > +
> > +	volatile unsigned long rx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pkt_tx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	volatile unsigned long tx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pmd_internals {
> > +	unsigned nb_queues;
> > +
> > +	int if_index;
> > +	struct ether_addr eth_addr;
> > +
> > +	struct tpacket_req req;
> > +
> > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +};
> > +
> > +static const char *valid_arguments[] = {
> > +	ETH_PACKET_IFACE_ARG,
> > +	ETH_PACKET_NUM_Q_ARG,
> > +	ETH_PACKET_BLOCKSIZE_ARG,
> > +	ETH_PACKET_FRAMESIZE_ARG,
> > +	ETH_PACKET_FRAMECOUNT_ARG,
> > +	NULL
> > +};
> > +
> > +static const char *drivername = "AF_PACKET PMD";
> > +
> > +static struct rte_eth_link pmd_link = {
> > +	.link_speed = 10000,
> > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > +	.link_status = 0
> > +};
> > +
> > +static uint16_t
> > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > +	unsigned i;
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	struct pkt_rx_queue *pkt_q = queue;
> > +	uint16_t num_rx = 0;
> > +	unsigned int framecount, framenum;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	/*
> > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > +	 * one and copies the packet data into a newly allocated mbuf.
> > +	 */
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > +			break;
> > +
> > +		/* allocate the next mbuf */
> > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > +		if (unlikely(mbuf == NULL))
> > +			break;
> > +
> > +		/* packet will fit in the mbuf, go ahead and receive it */
> > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_KERNEL;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +
> > +		/* account for the receive frame */
> > +		bufs[i] = mbuf;
> > +		num_rx++;
> > +	}
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->rx_pkts += num_rx;
> > +	return num_rx;
> > +}
> > +
> > +/*
> > + * Callback to handle sending packets through a real NIC.
> > + */
> > +static uint16_t
> > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	unsigned int framecount, framenum;
> > +	struct pollfd pfd;
> > +	struct pkt_tx_queue *pkt_q = queue;
> > +	uint16_t num_tx = 0;
> > +	int i;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	memset(&pfd, 0, sizeof(pfd));
> > +	pfd.fd = pkt_q->sockfd;
> > +	pfd.events = POLLOUT;
> > +	pfd.revents = 0;
> > +
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > +		    (poll(&pfd, 1, -1) < 0))
> > +				continue;
> > +
> > +		/* copy the tx frame data */
> > +		mbuf = bufs[num_tx];
> > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > +			sizeof(struct sockaddr_ll);
> > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +
> > +		num_tx++;
> > +		rte_pktmbuf_free(mbuf);
> > +	}
> > +
> > +	/* kick-off transmits */
> > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > +
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->tx_pkts += num_tx;
> > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > +	return num_tx;
> > +}
> > +
> > +static int
> > +eth_dev_start(struct rte_eth_dev *dev)
> > +{
> > +	dev->data->dev_link.link_status = 1;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * This function gets called when the current port gets stopped.
> > + */
> > +static void
> > +eth_dev_stop(struct rte_eth_dev *dev)
> > +{
> > +	unsigned i;
> > +	int sockfd;
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internals->nb_queues; i++) {
> > +		sockfd = internals->rx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +		sockfd = internals->tx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +	}
> > +
> > +	dev->data->dev_link.link_status = 0;
> > +}
> > +
> > +static int
> > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) {
> > +	return 0;
> > +}
> > +
> > +static void
> > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> > +*dev_info) {
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev_info->driver_name = drivername;
> > +	dev_info->if_index = internals->if_index;
> > +	dev_info->max_mac_addrs = 1;
> > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->min_rx_bufsize = 0;
> > +	dev_info->pci_dev = NULL;
> > +}
> > +
> > +static void
> > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > +{
> > +	unsigned i, imax;
> > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > +	const struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > +		rx_total += igb_stats->q_ipackets[i];
> > +	}
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > +		tx_total += igb_stats->q_opackets[i];
> > +		tx_err_total += igb_stats->q_errors[i];
> > +	}
> > +
> > +	igb_stats->ipackets = rx_total;
> > +	igb_stats->opackets = tx_total;
> > +	igb_stats->oerrors = tx_err_total;
> > +}
> > +
> > +static void
> > +eth_stats_reset(struct rte_eth_dev *dev) {
> > +	unsigned i;
> > +	struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++)
> > +		internal->rx_queue[i].rx_pkts = 0;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++) {
> > +		internal->tx_queue[i].tx_pkts = 0;
> > +		internal->tx_queue[i].err_pkts = 0;
> > +	}
> > +}
> > +
> > +static void
> > +eth_dev_close(struct rte_eth_dev *dev __rte_unused) { }
> > +
> > +static void
> > +eth_queue_release(void *q __rte_unused) { }
> > +
> > +static int
> > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > +                int wait_to_complete __rte_unused) {
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t rx_queue_id,
> > +                   uint16_t nb_rx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > +                   struct rte_mempool *mb_pool) {
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > +	uint16_t buf_size;
> > +
> > +	pkt_q->mb_pool = mb_pool;
> > +
> > +	/* Now get the space available for data in the mbuf */
> > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > +	                       RTE_PKTMBUF_HEADROOM);
> > +
> > +	if (ETH_FRAME_LEN > buf_size) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t tx_queue_id,
> > +                   uint16_t nb_tx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_txconf *tx_conf __rte_unused) {
> > +
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > +	return 0;
> > +}
> > +
> > +static struct eth_dev_ops ops = {
> > +	.dev_start = eth_dev_start,
> > +	.dev_stop = eth_dev_stop,
> > +	.dev_close = eth_dev_close,
> > +	.dev_configure = eth_dev_configure,
> > +	.dev_infos_get = eth_dev_info,
> > +	.rx_queue_setup = eth_rx_queue_setup,
> > +	.tx_queue_setup = eth_tx_queue_setup,
> > +	.rx_queue_release = eth_queue_release,
> > +	.tx_queue_release = eth_queue_release,
> > +	.link_update = eth_link_update,
> > +	.stats_get = eth_stats_get,
> > +	.stats_reset = eth_stats_reset,
> > +};
> > +
> > +/*
> > + * Opens an AF_PACKET socket
> > + */
> > +static int
> > +open_packet_iface(const char *key __rte_unused,
> > +                  const char *value __rte_unused,
> > +                  void *extra_args)
> > +{
> > +	int *sockfd = extra_args;
> > +
> > +	/* Open an AF_PACKET socket... */
> > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +	if (*sockfd == -1) {
> > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > +		return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +rte_pmd_init_internals(const char *name,
> > +                       const int sockfd,
> > +                       const unsigned nb_queues,
> > +                       unsigned int blocksize,
> > +                       unsigned int blockcnt,
> > +                       unsigned int framesize,
> > +                       unsigned int framecnt,
> > +                       const unsigned numa_node,
> > +                       struct pmd_internals **internals,
> > +                       struct rte_eth_dev **eth_dev,
> > +                       struct rte_kvargs *kvlist) {
> > +	struct rte_eth_dev_data *data = NULL;
> > +	struct rte_pci_device *pci_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	struct ifreq ifr;
> > +	size_t ifnamelen;
> > +	unsigned k_idx;
> > +	struct sockaddr_ll sockaddr;
> > +	struct tpacket_req *req;
> > +	struct pkt_rx_queue *rx_queue;
> > +	struct pkt_tx_queue *tx_queue;
> > +	int rc, tpver, discard, bypass;
> > +	unsigned int i, q, rdsize;
> > +	int qsockfd, fanout_arg;
> > +
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > +			break;
> > +	}
> > +	if (pair == NULL) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD,
> > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > +		name, numa_node);
> > +
> > +	/*
> > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > +	 * and internal (private) data
> > +	 */
> > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > +	if (data == NULL)
> > +		goto error;
> > +
> > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > +	if (pci_dev == NULL)
> > +		goto error;
> > +
> > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > +	                                0, numa_node);
> > +	if (*internals == NULL)
> > +		goto error;
> > +
> > +	req = &((*internals)->req);
> > +
> > +	req->tp_block_size = blocksize;
> > +	req->tp_block_nr = blockcnt;
> > +	req->tp_frame_size = framesize;
> > +	req->tp_frame_nr = framecnt;
> > +
> > +	ifnamelen = strlen(pair->value);
> > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > +		ifr.ifr_name[ifnamelen] = '\0';
> > +	} else {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: I/F name too long (%s)\n",
> > +			name, pair->value);
> > +		goto error;
> > +	}
> > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	(*internals)->if_index = ifr.ifr_ifindex;
> > +
> > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > +
> > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > +	sockaddr.sll_family = AF_PACKET;
> > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > +
> > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > +
> > +	for (q = 0; q < nb_queues; q++) {
> > +		/* Open an AF_PACKET socket for this queue... */
> > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +		if (qsockfd == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +			        "%s: could not open AF_PACKET socket\n",
> > +			        name);
> > +			return -1;
> > +		}
> > +
> > +		tpver = TPACKET_V2;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > +				&tpver, sizeof(tpver));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		discard = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > +				&discard, sizeof(discard));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_LOSS on "
> > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		bypass = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > +				&bypass, sizeof(bypass));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_QDISC_BYPASS "
> > +			        "on AF_PACKET socket for %s\n", name,
> > +			        pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req,
> > sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req,
> > sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rx_queue = &((*internals)->rx_queue[q]);
> > +		rx_queue->framecount = req->tp_frame_nr;
> > +
> > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > +				    PROT_READ | PROT_WRITE, MAP_SHARED |
> > MAP_LOCKED,
> > +				    qsockfd, 0);
> > +		if (rx_queue->map == MAP_FAILED) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > +				name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		/* rdsize is same for both Tx and Rx */
> > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > +
> > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		rx_queue->sockfd = qsockfd;
> > +
> > +		tx_queue = &((*internals)->tx_queue[q]);
> > +		tx_queue->framecount = req->tp_frame_nr;
> > +
> > +		tx_queue->map = rx_queue->map + req->tp_block_size *
> > +req->tp_block_nr;
> > +
> > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		tx_queue->sockfd = qsockfd;
> > +
> > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not bind AF_PACKET socket to %s\n",
> > +			        name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > +				&fanout_arg, sizeof(fanout_arg));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > +				"for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +	}
> > +
> > +	/* reserve an ethdev entry */
> > +	*eth_dev = rte_eth_dev_allocate(name);
> > +	if (*eth_dev == NULL)
> > +		goto error;
> > +
> > +	/*
> > +	 * now put it all together
> > +	 * - store queue data in internals,
> > +	 * - store numa_node info in pci_driver
> > +	 * - point eth_dev_data to internals and pci_driver
> > +	 * - and point eth_dev structure to new eth_dev_data structure
> > +	 */
> > +
> > +	(*internals)->nb_queues = nb_queues;
> > +
> > +	data->dev_private = *internals;
> > +	data->port_id = (*eth_dev)->data->port_id;
> > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > +	data->dev_link = pmd_link;
> > +	data->mac_addrs = &(*internals)->eth_addr;
> > +
> > +	pci_dev->numa_node = numa_node;
> > +
> > +	(*eth_dev)->data = data;
> > +	(*eth_dev)->dev_ops = &ops;
> > +	(*eth_dev)->pci_dev = pci_dev;
> > +
> > +	return 0;
> > +
> > +error:
> > +	if (data)
> > +		rte_free(data);
> > +	if (pci_dev)
> > +		rte_free(pci_dev);
> > +	for (q = 0; q < nb_queues; q++) {
> > +		if ((*internals)->rx_queue[q].rd)
> > +			rte_free((*internals)->rx_queue[q].rd);
> > +		if ((*internals)->tx_queue[q].rd)
> > +			rte_free((*internals)->tx_queue[q].rd);
> > +	}
> > +	if (*internals)
> > +		rte_free(*internals);
> > +	return -1;
> > +}
> > +
> > +static int
> > +rte_eth_from_packet(const char *name,
> > +                    int const *sockfd,
> > +                    const unsigned numa_node,
> > +                    struct rte_kvargs *kvlist) {
> > +	struct pmd_internals *internals = NULL;
> > +	struct rte_eth_dev *eth_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	unsigned k_idx;
> > +	unsigned int blockcount;
> > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > +	unsigned int qpairs = 1;
> > +
> > +	/* do some parameter checking */
> > +	if (*sockfd < 0)
> > +		return -1;
> > +
> > +	/*
> > +	 * Walk arguments for configurable settings
> > +	 */
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > +			qpairs = atoi(pair->value);
> > +			if (qpairs < 1 ||
> > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid qpairs value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > +			blocksize = atoi(pair->value);
> > +			if (!blocksize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid blocksize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > +			framesize = atoi(pair->value);
> > +			if (!framesize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framesize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > +			framecount = atoi(pair->value);
> > +			if (!framecount) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framecount value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +	}
> > +
> > +	if (framesize > blocksize) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > +		        name);
> > +		return -1;
> > +	}
> > +
> > +	blockcount = framecount / (blocksize / framesize);
> > +	if (!blockcount) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > +		return -1;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > +
> > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > +	                           blocksize, blockcount,
> > +	                           framesize, framecount,
> > +	                           numa_node, &internals, &eth_dev,
> > +	                           kvlist) < 0)
> > +		return -1;
> > +
> > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_pmd_packet_devinit(const char *name, const char *params) {
> > +	unsigned numa_node;
> > +	int ret;
> > +	struct rte_kvargs *kvlist;
> > +	int sockfd = -1;
> > +
> > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > +
> > +	numa_node = rte_socket_id();
> > +
> > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > +	if (kvlist == NULL)
> > +		return -1;
> > +
> > +	/*
> > +	 * If iface argument is passed we open the NICs and use them for
> > +	 * reading / writing
> > +	 */
> > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > +
> > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > +		                         &open_packet_iface, &sockfd);
> > +		if (ret < 0)
> > +			return -1;
> > +	}
> > +
> > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > +	close(sockfd); /* no longer needed */
> > +
> > +	if (ret < 0)
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct rte_driver pmd_packet_drv = {
> > +	.name = "eth_packet",
> > +	.type = PMD_VDEV,
> > +	.init = rte_pmd_packet_devinit,
> > +};
> > +
> > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h
> > b/lib/librte_pmd_packet/rte_eth_packet.h
> > new file mode 100644
> > index 000000000000..f685611da3e9
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > @@ -0,0 +1,55 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> > AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> > THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > DAMAGE.
> > + */
> > +
> > +#ifndef _RTE_ETH_PACKET_H_
> > +#define _RTE_ETH_PACKET_H_
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > +
> > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > +
> > +/**
> > + * For use by the EAL only. Called as part of EAL init to set up any
> > +dummy NICs
> > + * configured on command line.
> > + */
> > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif
> > diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 34dff2a02a05..a6994c4dbe93
> > 100644
> > --- a/mk/rte.app.mk
> > +++ b/mk/rte.app.mk
> > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)  LDLIBS
> > += -lrte_pmd_pcap -lpcap  endif
> > 
> > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > +LDLIBS += -lrte_pmd_packet
> > +endif
> > +
> >  endif # plugins
> > 
> >  LDLIBS += $(EXECENV_LDLIBS)
> > --
> > 1.9.3
> 
>
  
John W. Linville July 15, 2014, 2:01 p.m. UTC | #3
On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > According to my performance measurement results for 64B small
> > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > pps) which make sense to me as for 16 queues case more CPU cycles (16
> > queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled
> > ixgbe driver to switch between polling and interrupt modes in order
> > to service per-queue rx interrupts, so more context switch overhead
> > involved. Also, since the eth_packet_rx/eth_packet_tx routines involves
> > in two memory copies between DPDK mbuf and pbuf for each packet,
> > it can hardly achieve high performance unless packet are directly
> > DMA to mbuf which needs ixgbe driver to support.
> 
> I thought 16 queues would be spread out between as many cpus as you had though,
> obviating the need for context switches, no?

I think Danny is testing the single CPU case.  Having more queues
than CPUs probably does not provide any benefit.

It would be cool to hack the DPDK memory management to work directly
out of the mmap'ed AF_PACKET buffers.  But at this point I don't
have enough knowledge of DPDK internals to know if that is at all
reasonable...

John

P.S.  Danny, have you run any performance tests on the PCAP driver?
  
Zhou, Danny July 15, 2014, 3:34 p.m. UTC | #4
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Tuesday, July 15, 2014 8:18 PM
> To: Zhou, Danny
> Cc: John W. Linville; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > According to my performance measurement results for 64B small packet, 1 queue
> perf. is better than 16 queues (1.35M pps vs. 0.93M pps) which make sense to me as
> for 16 queues case more CPU cycles (16 queues' 87% vs. 1 queue' 80%) in kernel
> land needed for NAPI-enabled ixgbe driver to switch between polling and interrupt
> modes in order to service per-queue rx interrupts, so more context switch overhead
> involved. Also, since the eth_packet_rx/eth_packet_tx routines involves in two
> memory copies between DPDK mbuf and pbuf for each packet, it can hardly achieve
> high performance unless packet are directly DMA to mbuf which needs ixgbe driver
> to support.
> 
> I thought 16 queues would be spread out between as many cpus as you had though,
> obviating the need for context switches, no?
> Neil
> 

If you set those per-queue MSIX interrupt affinity to different cpus, then performance would be much better 
and linear scaling is expected. But in order to do apple-to-apple performance comparison against 1 queue case 
on single core, by default all interrupts are handled by one core, say core0, so lots of context switch impacts 
performance I think.

> >
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Tuesday, July 15, 2014 2:25 AM
> > > To: dev@dpdk.org
> > > Cc: Thomas Monjalon; Richardson, Bruce; Zhou, Danny
> > > Subject: [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based
> > > virtual devices
> > >
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  This implementation uses mmap'ed ring buffers to limit
> > > copying and user/kernel transitions.  The PACKET_FANOUT_HASH
> > > behavior of AF_PACKET is used for frame reception.  In the current
> > > implementation, Tx and Rx queues are always paired, and therefore
> > > are always equal in number -- changing this would be a Simple Matter Of
> Programming.
> > >
> > > Interfaces of this type are created with a command line option like
> > > "--vdev=eth_packet0,iface=...".  There are a number of options
> > > availabe as
> > > arguments:
> > >
> > >  - Interface is chosen by "iface" (required)
> > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default:
> > > 4096)
> > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default:
> > > 2048)
> > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default:
> > > 512)
> > >
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > This PMD is intended to provide a means for using DPDK on a broad
> > > range of hardware without hardware-specific PMDs and (hopefully)
> > > with better performance than what PCAP offers in Linux.  This might
> > > be useful as a development platform for DPDK applications when
> DPDK-supported hardware is expensive or unavailable.
> > >
> > > New in v2:
> > >
> > > -- fixup some style issues found by check patch
> > > -- use if_index as part of fanout group ID
> > > -- set default number of queue pairs to 1
> > >
> > >  config/common_bsdapp                   |   5 +
> > >  config/common_linuxapp                 |   5 +
> > >  lib/Makefile                           |   1 +
> > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > >  lib/librte_pmd_packet/rte_eth_packet.c | 826
> > > +++++++++++++++++++++++++++++++++
> > > lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > >  mk/rte.app.mk                          |   4 +
> > >  8 files changed, 957 insertions(+)
> > >  create mode 100644 lib/librte_pmd_packet/Makefile  create mode
> > > 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > >
> > > diff --git a/config/common_bsdapp b/config/common_bsdapp index
> > > 943dce8f1ede..c317f031278e 100644
> > > --- a/config/common_bsdapp
> > > +++ b/config/common_bsdapp
> > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > +
> > > +#
> > >  # Do prefetch of packet data within PMD driver receive function  #
> > > CONFIG_RTE_PMD_PACKET_PREFETCH=y diff --git
> a/config/common_linuxapp
> > > b/config/common_linuxapp index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > --- a/config/common_linuxapp
> > > +++ b/config/common_linuxapp
> > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only) #
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > +
> > > +#
> > >  # Compile Xen PMD
> > >  #
> > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > diff --git a/lib/Makefile b/lib/Makefile index
> > > 10c5bb3045bc..930fadf29898 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) +=
> > > librte_pmd_i40e
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt diff
> > > --git a/lib/librte_eal/linuxapp/eal/Makefile
> > > b/lib/librte_eal/linuxapp/eal/Makefile
> > > index 756d6b0c9301..feed24a63272 100644
> > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether  CFLAGS +=
> > > -I$(RTE_SDK)/lib/librte_ivshmem  CFLAGS +=
> > > -I$(RTE_SDK)/lib/librte_pmd_ring CFLAGS +=
> > > -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > >  CFLAGS += $(WERROR_FLAGS) -O3
> > >
> > > diff --git a/lib/librte_pmd_packet/Makefile
> > > b/lib/librte_pmd_packet/Makefile new file mode 100644 index
> > > 000000000000..e1266fb992cd
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/Makefile
> > > @@ -0,0 +1,60 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > +#   Copyright(c) 2014 6WIND S.A.
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> > > NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> > > THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +#
> > > +# library name
> > > +#
> > > +LIB = librte_pmd_packet.a
> > > +
> > > +CFLAGS += -O3
> > > +CFLAGS += $(WERROR_FLAGS)
> > > +
> > > +#
> > > +# all source are stored in SRCS-y
> > > +#
> > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > +
> > > +#
> > > +# Export include files
> > > +#
> > > +SYMLINK-y-include += rte_eth_packet.h
> > > +
> > > +# this lib depends upon:
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c
> > > b/lib/librte_pmd_packet/rte_eth_packet.c
> > > new file mode 100644
> > > index 000000000000..9c82d16e730f
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > @@ -0,0 +1,826 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > + *
> > > + *   Originally based upon librte_pmd_pcap code:
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   Copyright(c) 2014 6WIND S.A.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > > TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> > > THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > + */
> > > +
> > > +#include <rte_mbuf.h>
> > > +#include <rte_ethdev.h>
> > > +#include <rte_malloc.h>
> > > +#include <rte_kvargs.h>
> > > +#include <rte_dev.h>
> > > +
> > > +#include <linux/if_ether.h>
> > > +#include <linux/if_packet.h>
> > > +#include <arpa/inet.h>
> > > +#include <net/if.h>
> > > +#include <sys/types.h>
> > > +#include <sys/socket.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +#include <poll.h>
> > > +
> > > +#include "rte_eth_packet.h"
> > > +
> > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > +
> > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > +
> > > +struct pkt_rx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	struct rte_mempool *mb_pool;
> > > +
> > > +	volatile unsigned long rx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pkt_tx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	volatile unsigned long tx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pmd_internals {
> > > +	unsigned nb_queues;
> > > +
> > > +	int if_index;
> > > +	struct ether_addr eth_addr;
> > > +
> > > +	struct tpacket_req req;
> > > +
> > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +};
> > > +
> > > +static const char *valid_arguments[] = {
> > > +	ETH_PACKET_IFACE_ARG,
> > > +	ETH_PACKET_NUM_Q_ARG,
> > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > +	NULL
> > > +};
> > > +
> > > +static const char *drivername = "AF_PACKET PMD";
> > > +
> > > +static struct rte_eth_link pmd_link = {
> > > +	.link_speed = 10000,
> > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > +	.link_status = 0
> > > +};
> > > +
> > > +static uint16_t
> > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > > +	unsigned i;
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	struct pkt_rx_queue *pkt_q = queue;
> > > +	uint16_t num_rx = 0;
> > > +	unsigned int framecount, framenum;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > +	 */
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > +			break;
> > > +
> > > +		/* allocate the next mbuf */
> > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > +		if (unlikely(mbuf == NULL))
> > > +			break;
> > > +
> > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +
> > > +		/* account for the receive frame */
> > > +		bufs[i] = mbuf;
> > > +		num_rx++;
> > > +	}
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->rx_pkts += num_rx;
> > > +	return num_rx;
> > > +}
> > > +
> > > +/*
> > > + * Callback to handle sending packets through a real NIC.
> > > + */
> > > +static uint16_t
> > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) {
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	unsigned int framecount, framenum;
> > > +	struct pollfd pfd;
> > > +	struct pkt_tx_queue *pkt_q = queue;
> > > +	uint16_t num_tx = 0;
> > > +	int i;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	memset(&pfd, 0, sizeof(pfd));
> > > +	pfd.fd = pkt_q->sockfd;
> > > +	pfd.events = POLLOUT;
> > > +	pfd.revents = 0;
> > > +
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > +		    (poll(&pfd, 1, -1) < 0))
> > > +				continue;
> > > +
> > > +		/* copy the tx frame data */
> > > +		mbuf = bufs[num_tx];
> > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > +			sizeof(struct sockaddr_ll);
> > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +
> > > +		num_tx++;
> > > +		rte_pktmbuf_free(mbuf);
> > > +	}
> > > +
> > > +	/* kick-off transmits */
> > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > +
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->tx_pkts += num_tx;
> > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > +	return num_tx;
> > > +}
> > > +
> > > +static int
> > > +eth_dev_start(struct rte_eth_dev *dev) {
> > > +	dev->data->dev_link.link_status = 1;
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * This function gets called when the current port gets stopped.
> > > + */
> > > +static void
> > > +eth_dev_stop(struct rte_eth_dev *dev) {
> > > +	unsigned i;
> > > +	int sockfd;
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > +		sockfd = internals->rx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +		sockfd = internals->tx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +	}
> > > +
> > > +	dev->data->dev_link.link_status = 0; }
> > > +
> > > +static int
> > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused) {
> > > +	return 0;
> > > +}
> > > +
> > > +static void
> > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> > > +*dev_info) {
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev_info->driver_name = drivername;
> > > +	dev_info->if_index = internals->if_index;
> > > +	dev_info->max_mac_addrs = 1;
> > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->min_rx_bufsize = 0;
> > > +	dev_info->pci_dev = NULL;
> > > +}
> > > +
> > > +static void
> > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats
> > > +*igb_stats) {
> > > +	unsigned i, imax;
> > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > +		rx_total += igb_stats->q_ipackets[i];
> > > +	}
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > +		tx_total += igb_stats->q_opackets[i];
> > > +		tx_err_total += igb_stats->q_errors[i];
> > > +	}
> > > +
> > > +	igb_stats->ipackets = rx_total;
> > > +	igb_stats->opackets = tx_total;
> > > +	igb_stats->oerrors = tx_err_total; }
> > > +
> > > +static void
> > > +eth_stats_reset(struct rte_eth_dev *dev) {
> > > +	unsigned i;
> > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++)
> > > +		internal->rx_queue[i].rx_pkts = 0;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > +		internal->tx_queue[i].tx_pkts = 0;
> > > +		internal->tx_queue[i].err_pkts = 0;
> > > +	}
> > > +}
> > > +
> > > +static void
> > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused) { }
> > > +
> > > +static void
> > > +eth_queue_release(void *q __rte_unused) { }
> > > +
> > > +static int
> > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > +                int wait_to_complete __rte_unused) {
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t rx_queue_id,
> > > +                   uint16_t nb_rx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > +                   struct rte_mempool *mb_pool) {
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > +	uint16_t buf_size;
> > > +
> > > +	pkt_q->mb_pool = mb_pool;
> > > +
> > > +	/* Now get the space available for data in the mbuf */
> > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > +	                       RTE_PKTMBUF_HEADROOM);
> > > +
> > > +	if (ETH_FRAME_LEN > buf_size) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t tx_queue_id,
> > > +                   uint16_t nb_tx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_txconf *tx_conf
> > > +__rte_unused) {
> > > +
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > +	return 0;
> > > +}
> > > +
> > > +static struct eth_dev_ops ops = {
> > > +	.dev_start = eth_dev_start,
> > > +	.dev_stop = eth_dev_stop,
> > > +	.dev_close = eth_dev_close,
> > > +	.dev_configure = eth_dev_configure,
> > > +	.dev_infos_get = eth_dev_info,
> > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > +	.rx_queue_release = eth_queue_release,
> > > +	.tx_queue_release = eth_queue_release,
> > > +	.link_update = eth_link_update,
> > > +	.stats_get = eth_stats_get,
> > > +	.stats_reset = eth_stats_reset,
> > > +};
> > > +
> > > +/*
> > > + * Opens an AF_PACKET socket
> > > + */
> > > +static int
> > > +open_packet_iface(const char *key __rte_unused,
> > > +                  const char *value __rte_unused,
> > > +                  void *extra_args) {
> > > +	int *sockfd = extra_args;
> > > +
> > > +	/* Open an AF_PACKET socket... */
> > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +	if (*sockfd == -1) {
> > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +rte_pmd_init_internals(const char *name,
> > > +                       const int sockfd,
> > > +                       const unsigned nb_queues,
> > > +                       unsigned int blocksize,
> > > +                       unsigned int blockcnt,
> > > +                       unsigned int framesize,
> > > +                       unsigned int framecnt,
> > > +                       const unsigned numa_node,
> > > +                       struct pmd_internals **internals,
> > > +                       struct rte_eth_dev **eth_dev,
> > > +                       struct rte_kvargs *kvlist) {
> > > +	struct rte_eth_dev_data *data = NULL;
> > > +	struct rte_pci_device *pci_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	struct ifreq ifr;
> > > +	size_t ifnamelen;
> > > +	unsigned k_idx;
> > > +	struct sockaddr_ll sockaddr;
> > > +	struct tpacket_req *req;
> > > +	struct pkt_rx_queue *rx_queue;
> > > +	struct pkt_tx_queue *tx_queue;
> > > +	int rc, tpver, discard, bypass;
> > > +	unsigned int i, q, rdsize;
> > > +	int qsockfd, fanout_arg;
> > > +
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > +			break;
> > > +	}
> > > +	if (pair == NULL) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD,
> > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > +		name, numa_node);
> > > +
> > > +	/*
> > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > +	 * and internal (private) data
> > > +	 */
> > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > +	if (data == NULL)
> > > +		goto error;
> > > +
> > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > +	if (pci_dev == NULL)
> > > +		goto error;
> > > +
> > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > +	                                0, numa_node);
> > > +	if (*internals == NULL)
> > > +		goto error;
> > > +
> > > +	req = &((*internals)->req);
> > > +
> > > +	req->tp_block_size = blocksize;
> > > +	req->tp_block_nr = blockcnt;
> > > +	req->tp_frame_size = framesize;
> > > +	req->tp_frame_nr = framecnt;
> > > +
> > > +	ifnamelen = strlen(pair->value);
> > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > +	} else {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: I/F name too long (%s)\n",
> > > +			name, pair->value);
> > > +		goto error;
> > > +	}
> > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > +
> > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > +
> > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > +	sockaddr.sll_family = AF_PACKET;
> > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > +
> > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > +	fanout_arg |= (PACKET_FANOUT_HASH |
> PACKET_FANOUT_FLAG_DEFRAG |
> > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > +
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		/* Open an AF_PACKET socket for this queue... */
> > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +		if (qsockfd == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +			        "%s: could not open AF_PACKET socket\n",
> > > +			        name);
> > > +			return -1;
> > > +		}
> > > +
> > > +		tpver = TPACKET_V2;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > +				&tpver, sizeof(tpver));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		discard = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > +				&discard, sizeof(discard));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_LOSS on "
> > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		bypass = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > +				&bypass, sizeof(bypass));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > +			        "on AF_PACKET socket for %s\n", name,
> > > +			        pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req,
> > > sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req,
> > > sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > +		rx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size *
> req->tp_block_nr,
> > > +				    PROT_READ | PROT_WRITE, MAP_SHARED |
> > > MAP_LOCKED,
> > > +				    qsockfd, 0);
> > > +		if (rx_queue->map == MAP_FAILED) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > +				name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		/* rdsize is same for both Tx and Rx */
> > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > +
> > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		rx_queue->sockfd = qsockfd;
> > > +
> > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > +		tx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		tx_queue->map = rx_queue->map + req->tp_block_size *
> > > +req->tp_block_nr;
> > > +
> > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		tx_queue->sockfd = qsockfd;
> > > +
> > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr,
> sizeof(sockaddr));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > +			        name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > +				&fanout_arg, sizeof(fanout_arg));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_FANOUT on AF_PACKET
> socket "
> > > +				"for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +	}
> > > +
> > > +	/* reserve an ethdev entry */
> > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > +	if (*eth_dev == NULL)
> > > +		goto error;
> > > +
> > > +	/*
> > > +	 * now put it all together
> > > +	 * - store queue data in internals,
> > > +	 * - store numa_node info in pci_driver
> > > +	 * - point eth_dev_data to internals and pci_driver
> > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > +	 */
> > > +
> > > +	(*internals)->nb_queues = nb_queues;
> > > +
> > > +	data->dev_private = *internals;
> > > +	data->port_id = (*eth_dev)->data->port_id;
> > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > +	data->dev_link = pmd_link;
> > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > +
> > > +	pci_dev->numa_node = numa_node;
> > > +
> > > +	(*eth_dev)->data = data;
> > > +	(*eth_dev)->dev_ops = &ops;
> > > +	(*eth_dev)->pci_dev = pci_dev;
> > > +
> > > +	return 0;
> > > +
> > > +error:
> > > +	if (data)
> > > +		rte_free(data);
> > > +	if (pci_dev)
> > > +		rte_free(pci_dev);
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		if ((*internals)->rx_queue[q].rd)
> > > +			rte_free((*internals)->rx_queue[q].rd);
> > > +		if ((*internals)->tx_queue[q].rd)
> > > +			rte_free((*internals)->tx_queue[q].rd);
> > > +	}
> > > +	if (*internals)
> > > +		rte_free(*internals);
> > > +	return -1;
> > > +}
> > > +
> > > +static int
> > > +rte_eth_from_packet(const char *name,
> > > +                    int const *sockfd,
> > > +                    const unsigned numa_node,
> > > +                    struct rte_kvargs *kvlist) {
> > > +	struct pmd_internals *internals = NULL;
> > > +	struct rte_eth_dev *eth_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	unsigned k_idx;
> > > +	unsigned int blockcount;
> > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > +	unsigned int qpairs = 1;
> > > +
> > > +	/* do some parameter checking */
> > > +	if (*sockfd < 0)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * Walk arguments for configurable settings
> > > +	 */
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > +			qpairs = atoi(pair->value);
> > > +			if (qpairs < 1 ||
> > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid qpairs value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > +			blocksize = atoi(pair->value);
> > > +			if (!blocksize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid blocksize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > +			framesize = atoi(pair->value);
> > > +			if (!framesize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framesize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > +			framecount = atoi(pair->value);
> > > +			if (!framecount) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framecount value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +	}
> > > +
> > > +	if (framesize > blocksize) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > +		        name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	blockcount = framecount / (blocksize / framesize);
> > > +	if (!blockcount) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > +
> > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > +	                           blocksize, blockcount,
> > > +	                           framesize, framecount,
> > > +	                           numa_node, &internals, &eth_dev,
> > > +	                           kvlist) < 0)
> > > +		return -1;
> > > +
> > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int
> > > +rte_pmd_packet_devinit(const char *name, const char *params) {
> > > +	unsigned numa_node;
> > > +	int ret;
> > > +	struct rte_kvargs *kvlist;
> > > +	int sockfd = -1;
> > > +
> > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > +
> > > +	numa_node = rte_socket_id();
> > > +
> > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > +	if (kvlist == NULL)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * If iface argument is passed we open the NICs and use them for
> > > +	 * reading / writing
> > > +	 */
> > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > +
> > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > +		                         &open_packet_iface, &sockfd);
> > > +		if (ret < 0)
> > > +			return -1;
> > > +	}
> > > +
> > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > +	close(sockfd); /* no longer needed */
> > > +
> > > +	if (ret < 0)
> > > +		return -1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static struct rte_driver pmd_packet_drv = {
> > > +	.name = "eth_packet",
> > > +	.type = PMD_VDEV,
> > > +	.init = rte_pmd_packet_devinit,
> > > +};
> > > +
> > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h
> > > b/lib/librte_pmd_packet/rte_eth_packet.h
> > > new file mode 100644
> > > index 000000000000..f685611da3e9
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > @@ -0,0 +1,55 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> > > COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> > > INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> > > NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> > > LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED
> > > AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> > > TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> > > THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> > > DAMAGE.
> > > + */
> > > +
> > > +#ifndef _RTE_ETH_PACKET_H_
> > > +#define _RTE_ETH_PACKET_H_
> > > +
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > > +
> > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > +
> > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > +
> > > +/**
> > > + * For use by the EAL only. Called as part of EAL init to set up
> > > +any dummy NICs
> > > + * configured on command line.
> > > + */
> > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > +
> > > +#ifdef __cplusplus
> > > +}
> > > +#endif
> > > +
> > > +#endif
> > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk index
> > > 34dff2a02a05..a6994c4dbe93
> > > 100644
> > > --- a/mk/rte.app.mk
> > > +++ b/mk/rte.app.mk
> > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> LDLIBS
> > > += -lrte_pmd_pcap -lpcap  endif
> > >
> > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > +LDLIBS += -lrte_pmd_packet
> > > +endif
> > > +
> > >  endif # plugins
> > >
> > >  LDLIBS += $(EXECENV_LDLIBS)
> > > --
> > > 1.9.3
> >
> >
  
Zhou, Danny July 15, 2014, 3:40 p.m. UTC | #5
> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Tuesday, July 15, 2014 10:01 PM
> To: Neil Horman
> Cc: Zhou, Danny; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > According to my performance measurement results for 64B small
> > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > > pps) which make sense to me as for 16 queues case more CPU cycles
> > > (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for
> > > NAPI-enabled ixgbe driver to switch between polling and interrupt
> > > modes in order to service per-queue rx interrupts, so more context
> > > switch overhead involved. Also, since the
> > > eth_packet_rx/eth_packet_tx routines involves in two memory copies
> > > between DPDK mbuf and pbuf for each packet, it can hardly achieve
> > > high performance unless packet are directly DMA to mbuf which needs ixgbe
> driver to support.
> >
> > I thought 16 queues would be spread out between as many cpus as you
> > had though, obviating the need for context switches, no?
> 
> I think Danny is testing the single CPU case.  Having more queues than CPUs
> probably does not provide any benefit.
> 
> It would be cool to hack the DPDK memory management to work directly out of the
> mmap'ed AF_PACKET buffers.  But at this point I don't have enough knowledge of
> DPDK internals to know if that is at all reasonable...
> 
> John
> 
> P.S.  Danny, have you run any performance tests on the PCAP driver?

No, I do not have PCAP driver performance results in hand. But I remember it is less than
1M pps for 64B.

> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.
  
John W. Linville July 15, 2014, 7:08 p.m. UTC | #6
On Tue, Jul 15, 2014 at 03:40:56PM +0000, Zhou, Danny wrote:
> 
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Tuesday, July 15, 2014 10:01 PM
> > To: Neil Horman
> > Cc: Zhou, Danny; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> > AF_PACKET-based virtual devices
> > 
> > On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > > According to my performance measurement results for 64B small
> > > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > > > pps) which make sense to me as for 16 queues case more CPU cycles
> > > > (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for
> > > > NAPI-enabled ixgbe driver to switch between polling and interrupt
> > > > modes in order to service per-queue rx interrupts, so more context
> > > > switch overhead involved. Also, since the
> > > > eth_packet_rx/eth_packet_tx routines involves in two memory copies
> > > > between DPDK mbuf and pbuf for each packet, it can hardly achieve
> > > > high performance unless packet are directly DMA to mbuf which needs ixgbe
> > driver to support.
> > >
> > > I thought 16 queues would be spread out between as many cpus as you
> > > had though, obviating the need for context switches, no?
> > 
> > I think Danny is testing the single CPU case.  Having more queues than CPUs
> > probably does not provide any benefit.
> > 
> > It would be cool to hack the DPDK memory management to work directly out of the
> > mmap'ed AF_PACKET buffers.  But at this point I don't have enough knowledge of
> > DPDK internals to know if that is at all reasonable...
> > 
> > John
> > 
> > P.S.  Danny, have you run any performance tests on the PCAP driver?
> 
> No, I do not have PCAP driver performance results in hand. But I remember it is less than
> 1M pps for 64B.

Cool, good info...thanks!
  
Neil Horman July 15, 2014, 8:31 p.m. UTC | #7
On Tue, Jul 15, 2014 at 10:01:11AM -0400, John W. Linville wrote:
> On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > According to my performance measurement results for 64B small
> > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs. 0.93M
> > > pps) which make sense to me as for 16 queues case more CPU cycles (16
> > > queues' 87% vs. 1 queue' 80%) in kernel land needed for NAPI-enabled
> > > ixgbe driver to switch between polling and interrupt modes in order
> > > to service per-queue rx interrupts, so more context switch overhead
> > > involved. Also, since the eth_packet_rx/eth_packet_tx routines involves
> > > in two memory copies between DPDK mbuf and pbuf for each packet,
> > > it can hardly achieve high performance unless packet are directly
> > > DMA to mbuf which needs ixgbe driver to support.
> > 
> > I thought 16 queues would be spread out between as many cpus as you had though,
> > obviating the need for context switches, no?
> 
> I think Danny is testing the single CPU case.  Having more queues
> than CPUs probably does not provide any benefit.
> 
Ah, yes, generally speaking, you never want nr_cpus < nr_queues.  Otherwise
you'll just be fighting yourself.

> It would be cool to hack the DPDK memory management to work directly
> out of the mmap'ed AF_PACKET buffers.  But at this point I don't
> have enough knowledge of DPDK internals to know if that is at all
> reasonable...
> 
> John
> 
> P.S.  Danny, have you run any performance tests on the PCAP driver?
> 
> -- 
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.
>
  
Zhou, Danny July 15, 2014, 8:41 p.m. UTC | #8
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, July 16, 2014 4:31 AM
> To: John W. Linville
> Cc: Zhou, Danny; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for
> AF_PACKET-based virtual devices
> 
> On Tue, Jul 15, 2014 at 10:01:11AM -0400, John W. Linville wrote:
> > On Tue, Jul 15, 2014 at 08:17:44AM -0400, Neil Horman wrote:
> > > On Tue, Jul 15, 2014 at 12:15:49AM +0000, Zhou, Danny wrote:
> > > > According to my performance measurement results for 64B small
> > > > packet, 1 queue perf. is better than 16 queues (1.35M pps vs.
> > > > 0.93M
> > > > pps) which make sense to me as for 16 queues case more CPU cycles
> > > > (16 queues' 87% vs. 1 queue' 80%) in kernel land needed for
> > > > NAPI-enabled ixgbe driver to switch between polling and interrupt
> > > > modes in order to service per-queue rx interrupts, so more context
> > > > switch overhead involved. Also, since the
> > > > eth_packet_rx/eth_packet_tx routines involves in two memory copies
> > > > between DPDK mbuf and pbuf for each packet, it can hardly achieve
> > > > high performance unless packet are directly DMA to mbuf which needs ixgbe
> driver to support.
> > >
> > > I thought 16 queues would be spread out between as many cpus as you
> > > had though, obviating the need for context switches, no?
> >
> > I think Danny is testing the single CPU case.  Having more queues than
> > CPUs probably does not provide any benefit.
> >
> Ah, yes, generally speaking, you never want nr_cpus < nr_queues.  Otherwise you'll
> just be fighting yourself.
> 

It is true for interrupt based NIC driver and this AF_PACKET based PMD because it depends 
on kernel NIC driver. But for poll-mode based DPDK native NIC driver, you can have a cpu pinning to
to a core polling multiple queues on a NIC or queues on different NICs, at the cost of more
power consumption or wasted CPU cycles busying waiting packets.

> > It would be cool to hack the DPDK memory management to work directly
> > out of the mmap'ed AF_PACKET buffers.  But at this point I don't have
> > enough knowledge of DPDK internals to know if that is at all
> > reasonable...
> >
> > John
> >
> > P.S.  Danny, have you run any performance tests on the PCAP driver?
> >
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
> >
  
John W. Linville Sept. 12, 2014, 6:05 p.m. UTC | #9
Ping?  Are there objections to this patch from mid-July?

John

On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> socket.  This implementation uses mmap'ed ring buffers to limit copying
> and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> AF_PACKET is used for frame reception.  In the current implementation,
> Tx and Rx queues are always paired, and therefore are always equal
> in number -- changing this would be a Simple Matter Of Programming.
> 
> Interfaces of this type are created with a command line option like
> "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> as arguments:
> 
>  - Interface is chosen by "iface" (required)
>  - Number of queue pairs set by "qpairs" (optional, default: 1)
>  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
>  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
>  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> This PMD is intended to provide a means for using DPDK on a broad
> range of hardware without hardware-specific PMDs and (hopefully)
> with better performance than what PCAP offers in Linux.  This might
> be useful as a development platform for DPDK applications when
> DPDK-supported hardware is expensive or unavailable.
> 
> New in v2:
> 
> -- fixup some style issues found by check patch
> -- use if_index as part of fanout group ID
> -- set default number of queue pairs to 1
> 
>  config/common_bsdapp                   |   5 +
>  config/common_linuxapp                 |   5 +
>  lib/Makefile                           |   1 +
>  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
>  lib/librte_pmd_packet/Makefile         |  60 +++
>  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
>  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
>  mk/rte.app.mk                          |   4 +
>  8 files changed, 957 insertions(+)
>  create mode 100644 lib/librte_pmd_packet/Makefile
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
>  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> 
> diff --git a/config/common_bsdapp b/config/common_bsdapp
> index 943dce8f1ede..c317f031278e 100644
> --- a/config/common_bsdapp
> +++ b/config/common_bsdapp
> @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
>  CONFIG_RTE_LIBRTE_PMD_BOND=y
>  
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> +#
> +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> +
> +#
>  # Do prefetch of packet data within PMD driver receive function
>  #
>  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> diff --git a/config/common_linuxapp b/config/common_linuxapp
> index 7bf5d80d4e26..f9e7bc3015ec 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
>  CONFIG_RTE_LIBRTE_PMD_BOND=y
>  
>  #
> +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> +#
> +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> +
> +#
>  # Compile Xen PMD
>  #
>  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> diff --git a/lib/Makefile b/lib/Makefile
> index 10c5bb3045bc..930fadf29898 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
>  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
>  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> index 756d6b0c9301..feed24a63272 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
>  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
>  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
>  CFLAGS += $(WERROR_FLAGS) -O3
>  
> diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> new file mode 100644
> index 000000000000..e1266fb992cd
> --- /dev/null
> +++ b/lib/librte_pmd_packet/Makefile
> @@ -0,0 +1,60 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_pmd_packet.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +#
> +# all source are stored in SRCS-y
> +#
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> +
> +#
> +# Export include files
> +#
> +SYMLINK-y-include += rte_eth_packet.h
> +
> +# this lib depends upon:
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> new file mode 100644
> index 000000000000..9c82d16e730f
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> @@ -0,0 +1,826 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> + *
> + *   Originally based upon librte_pmd_pcap code:
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2014 6WIND S.A.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_mbuf.h>
> +#include <rte_ethdev.h>
> +#include <rte_malloc.h>
> +#include <rte_kvargs.h>
> +#include <rte_dev.h>
> +
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <arpa/inet.h>
> +#include <net/if.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <poll.h>
> +
> +#include "rte_eth_packet.h"
> +
> +#define ETH_PACKET_IFACE_ARG		"iface"
> +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> +
> +#define DFLT_BLOCK_SIZE		(1 << 12)
> +#define DFLT_FRAME_SIZE		(1 << 11)
> +#define DFLT_FRAME_COUNT	(1 << 9)
> +
> +struct pkt_rx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	struct rte_mempool *mb_pool;
> +
> +	volatile unsigned long rx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pkt_tx_queue {
> +	int sockfd;
> +
> +	struct iovec *rd;
> +	uint8_t *map;
> +	unsigned int framecount;
> +	unsigned int framenum;
> +
> +	volatile unsigned long tx_pkts;
> +	volatile unsigned long err_pkts;
> +};
> +
> +struct pmd_internals {
> +	unsigned nb_queues;
> +
> +	int if_index;
> +	struct ether_addr eth_addr;
> +
> +	struct tpacket_req req;
> +
> +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> +};
> +
> +static const char *valid_arguments[] = {
> +	ETH_PACKET_IFACE_ARG,
> +	ETH_PACKET_NUM_Q_ARG,
> +	ETH_PACKET_BLOCKSIZE_ARG,
> +	ETH_PACKET_FRAMESIZE_ARG,
> +	ETH_PACKET_FRAMECOUNT_ARG,
> +	NULL
> +};
> +
> +static const char *drivername = "AF_PACKET PMD";
> +
> +static struct rte_eth_link pmd_link = {
> +	.link_speed = 10000,
> +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> +	.link_status = 0
> +};
> +
> +static uint16_t
> +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	unsigned i;
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	struct pkt_rx_queue *pkt_q = queue;
> +	uint16_t num_rx = 0;
> +	unsigned int framecount, framenum;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	/*
> +	 * Reads the given number of packets from the AF_PACKET socket one by
> +	 * one and copies the packet data into a newly allocated mbuf.
> +	 */
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> +			break;
> +
> +		/* allocate the next mbuf */
> +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> +		if (unlikely(mbuf == NULL))
> +			break;
> +
> +		/* packet will fit in the mbuf, go ahead and receive it */
> +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_KERNEL;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +
> +		/* account for the receive frame */
> +		bufs[i] = mbuf;
> +		num_rx++;
> +	}
> +	pkt_q->framenum = framenum;
> +	pkt_q->rx_pkts += num_rx;
> +	return num_rx;
> +}
> +
> +/*
> + * Callback to handle sending packets through a real NIC.
> + */
> +static uint16_t
> +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	struct tpacket2_hdr *ppd;
> +	struct rte_mbuf *mbuf;
> +	uint8_t *pbuf;
> +	unsigned int framecount, framenum;
> +	struct pollfd pfd;
> +	struct pkt_tx_queue *pkt_q = queue;
> +	uint16_t num_tx = 0;
> +	int i;
> +
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	memset(&pfd, 0, sizeof(pfd));
> +	pfd.fd = pkt_q->sockfd;
> +	pfd.events = POLLOUT;
> +	pfd.revents = 0;
> +
> +	framecount = pkt_q->framecount;
> +	framenum = pkt_q->framenum;
> +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +	for (i = 0; i < nb_pkts; i++) {
> +		/* point at the next incoming frame */
> +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> +		    (poll(&pfd, 1, -1) < 0))
> +				continue;
> +
> +		/* copy the tx frame data */
> +		mbuf = bufs[num_tx];
> +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> +			sizeof(struct sockaddr_ll);
> +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> +
> +		/* release incoming frame and advance ring buffer */
> +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> +		if (++framenum >= framecount)
> +			framenum = 0;
> +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> +
> +		num_tx++;
> +		rte_pktmbuf_free(mbuf);
> +	}
> +
> +	/* kick-off transmits */
> +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +
> +	pkt_q->framenum = framenum;
> +	pkt_q->tx_pkts += num_tx;
> +	pkt_q->err_pkts += nb_pkts - num_tx;
> +	return num_tx;
> +}
> +
> +static int
> +eth_dev_start(struct rte_eth_dev *dev)
> +{
> +	dev->data->dev_link.link_status = 1;
> +	return 0;
> +}
> +
> +/*
> + * This function gets called when the current port gets stopped.
> + */
> +static void
> +eth_dev_stop(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	int sockfd;
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	for (i = 0; i < internals->nb_queues; i++) {
> +		sockfd = internals->rx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +		sockfd = internals->tx_queue[i].sockfd;
> +		if (sockfd != -1)
> +			close(sockfd);
> +	}
> +
> +	dev->data->dev_link.link_status = 0;
> +}
> +
> +static int
> +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> +{
> +	return 0;
> +}
> +
> +static void
> +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> +{
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev_info->driver_name = drivername;
> +	dev_info->if_index = internals->if_index;
> +	dev_info->max_mac_addrs = 1;
> +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> +	dev_info->min_rx_bufsize = 0;
> +	dev_info->pci_dev = NULL;
> +}
> +
> +static void
> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> +{
> +	unsigned i, imax;
> +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> +	const struct pmd_internals *internal = dev->data->dev_private;
> +
> +	memset(igb_stats, 0, sizeof(*igb_stats));
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> +		rx_total += igb_stats->q_ipackets[i];
> +	}
> +
> +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> +	for (i = 0; i < imax; i++) {
> +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> +		tx_total += igb_stats->q_opackets[i];
> +		tx_err_total += igb_stats->q_errors[i];
> +	}
> +
> +	igb_stats->ipackets = rx_total;
> +	igb_stats->opackets = tx_total;
> +	igb_stats->oerrors = tx_err_total;
> +}
> +
> +static void
> +eth_stats_reset(struct rte_eth_dev *dev)
> +{
> +	unsigned i;
> +	struct pmd_internals *internal = dev->data->dev_private;
> +
> +	for (i = 0; i < internal->nb_queues; i++)
> +		internal->rx_queue[i].rx_pkts = 0;
> +
> +	for (i = 0; i < internal->nb_queues; i++) {
> +		internal->tx_queue[i].tx_pkts = 0;
> +		internal->tx_queue[i].err_pkts = 0;
> +	}
> +}
> +
> +static void
> +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> +{
> +}
> +
> +static void
> +eth_queue_release(void *q __rte_unused)
> +{
> +}
> +
> +static int
> +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> +                int wait_to_complete __rte_unused)
> +{
> +	return 0;
> +}
> +
> +static int
> +eth_rx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t rx_queue_id,
> +                   uint16_t nb_rx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> +                   struct rte_mempool *mb_pool)
> +{
> +	struct pmd_internals *internals = dev->data->dev_private;
> +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> +	struct rte_pktmbuf_pool_private *mbp_priv;
> +	uint16_t buf_size;
> +
> +	pkt_q->mb_pool = mb_pool;
> +
> +	/* Now get the space available for data in the mbuf */
> +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> +	                       RTE_PKTMBUF_HEADROOM);
> +
> +	if (ETH_FRAME_LEN > buf_size) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> +			dev->data->name, ETH_FRAME_LEN, buf_size);
> +		return -ENOMEM;
> +	}
> +
> +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> +
> +	return 0;
> +}
> +
> +static int
> +eth_tx_queue_setup(struct rte_eth_dev *dev,
> +                   uint16_t tx_queue_id,
> +                   uint16_t nb_tx_desc __rte_unused,
> +                   unsigned int socket_id __rte_unused,
> +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> +{
> +
> +	struct pmd_internals *internals = dev->data->dev_private;
> +
> +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> +	return 0;
> +}
> +
> +static struct eth_dev_ops ops = {
> +	.dev_start = eth_dev_start,
> +	.dev_stop = eth_dev_stop,
> +	.dev_close = eth_dev_close,
> +	.dev_configure = eth_dev_configure,
> +	.dev_infos_get = eth_dev_info,
> +	.rx_queue_setup = eth_rx_queue_setup,
> +	.tx_queue_setup = eth_tx_queue_setup,
> +	.rx_queue_release = eth_queue_release,
> +	.tx_queue_release = eth_queue_release,
> +	.link_update = eth_link_update,
> +	.stats_get = eth_stats_get,
> +	.stats_reset = eth_stats_reset,
> +};
> +
> +/*
> + * Opens an AF_PACKET socket
> + */
> +static int
> +open_packet_iface(const char *key __rte_unused,
> +                  const char *value __rte_unused,
> +                  void *extra_args)
> +{
> +	int *sockfd = extra_args;
> +
> +	/* Open an AF_PACKET socket... */
> +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +	if (*sockfd == -1) {
> +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_pmd_init_internals(const char *name,
> +                       const int sockfd,
> +                       const unsigned nb_queues,
> +                       unsigned int blocksize,
> +                       unsigned int blockcnt,
> +                       unsigned int framesize,
> +                       unsigned int framecnt,
> +                       const unsigned numa_node,
> +                       struct pmd_internals **internals,
> +                       struct rte_eth_dev **eth_dev,
> +                       struct rte_kvargs *kvlist)
> +{
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_device *pci_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	struct ifreq ifr;
> +	size_t ifnamelen;
> +	unsigned k_idx;
> +	struct sockaddr_ll sockaddr;
> +	struct tpacket_req *req;
> +	struct pkt_rx_queue *rx_queue;
> +	struct pkt_tx_queue *tx_queue;
> +	int rc, tpver, discard, bypass;
> +	unsigned int i, q, rdsize;
> +	int qsockfd, fanout_arg;
> +
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> +			break;
> +	}
> +	if (pair == NULL) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: no interface specified for AF_PACKET ethdev\n",
> +		        name);
> +		goto error;
> +	}
> +
> +	RTE_LOG(INFO, PMD,
> +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> +		name, numa_node);
> +
> +	/*
> +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> +	 * and internal (private) data
> +	 */
> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> +	if (data == NULL)
> +		goto error;
> +
> +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> +	if (pci_dev == NULL)
> +		goto error;
> +
> +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> +	                                0, numa_node);
> +	if (*internals == NULL)
> +		goto error;
> +
> +	req = &((*internals)->req);
> +
> +	req->tp_block_size = blocksize;
> +	req->tp_block_nr = blockcnt;
> +	req->tp_frame_size = framesize;
> +	req->tp_frame_nr = framecnt;
> +
> +	ifnamelen = strlen(pair->value);
> +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> +		ifr.ifr_name[ifnamelen] = '\0';
> +	} else {
> +		RTE_LOG(ERR, PMD,
> +			"%s: I/F name too long (%s)\n",
> +			name, pair->value);
> +		goto error;
> +	}
> +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> +		        name);
> +		goto error;
> +	}
> +	(*internals)->if_index = ifr.ifr_ifindex;
> +
> +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> +		        name);
> +		goto error;
> +	}
> +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> +
> +	memset(&sockaddr, 0, sizeof(sockaddr));
> +	sockaddr.sll_family = AF_PACKET;
> +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> +	sockaddr.sll_ifindex = (*internals)->if_index;
> +
> +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> +
> +	for (q = 0; q < nb_queues; q++) {
> +		/* Open an AF_PACKET socket for this queue... */
> +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> +		if (qsockfd == -1) {
> +			RTE_LOG(ERR, PMD,
> +			        "%s: could not open AF_PACKET socket\n",
> +			        name);
> +			return -1;
> +		}
> +
> +		tpver = TPACKET_V2;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> +				&tpver, sizeof(tpver));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_VERSION on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		discard = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> +				&discard, sizeof(discard));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_LOSS on "
> +			        "AF_PACKET socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		bypass = 1;
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> +				&bypass, sizeof(bypass));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_QDISC_BYPASS "
> +			        "on AF_PACKET socket for %s\n", name,
> +			        pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> +				"socket for %s\n", name, pair->value);
> +			goto error;
> +		}
> +
> +		rx_queue = &((*internals)->rx_queue[q]);
> +		rx_queue->framecount = req->tp_frame_nr;
> +
> +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> +				    qsockfd, 0);
> +		if (rx_queue->map == MAP_FAILED) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> +				name, pair->value);
> +			goto error;
> +		}
> +
> +		/* rdsize is same for both Tx and Rx */
> +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> +
> +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		rx_queue->sockfd = qsockfd;
> +
> +		tx_queue = &((*internals)->tx_queue[q]);
> +		tx_queue->framecount = req->tp_frame_nr;
> +
> +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> +
> +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> +		for (i = 0; i < req->tp_frame_nr; ++i) {
> +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> +		}
> +		tx_queue->sockfd = qsockfd;
> +
> +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not bind AF_PACKET socket to %s\n",
> +			        name, pair->value);
> +			goto error;
> +		}
> +
> +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> +				&fanout_arg, sizeof(fanout_arg));
> +		if (rc == -1) {
> +			RTE_LOG(ERR, PMD,
> +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> +				"for %s\n", name, pair->value);
> +			goto error;
> +		}
> +	}
> +
> +	/* reserve an ethdev entry */
> +	*eth_dev = rte_eth_dev_allocate(name);
> +	if (*eth_dev == NULL)
> +		goto error;
> +
> +	/*
> +	 * now put it all together
> +	 * - store queue data in internals,
> +	 * - store numa_node info in pci_driver
> +	 * - point eth_dev_data to internals and pci_driver
> +	 * - and point eth_dev structure to new eth_dev_data structure
> +	 */
> +
> +	(*internals)->nb_queues = nb_queues;
> +
> +	data->dev_private = *internals;
> +	data->port_id = (*eth_dev)->data->port_id;
> +	data->nb_rx_queues = (uint16_t)nb_queues;
> +	data->nb_tx_queues = (uint16_t)nb_queues;
> +	data->dev_link = pmd_link;
> +	data->mac_addrs = &(*internals)->eth_addr;
> +
> +	pci_dev->numa_node = numa_node;
> +
> +	(*eth_dev)->data = data;
> +	(*eth_dev)->dev_ops = &ops;
> +	(*eth_dev)->pci_dev = pci_dev;
> +
> +	return 0;
> +
> +error:
> +	if (data)
> +		rte_free(data);
> +	if (pci_dev)
> +		rte_free(pci_dev);
> +	for (q = 0; q < nb_queues; q++) {
> +		if ((*internals)->rx_queue[q].rd)
> +			rte_free((*internals)->rx_queue[q].rd);
> +		if ((*internals)->tx_queue[q].rd)
> +			rte_free((*internals)->tx_queue[q].rd);
> +	}
> +	if (*internals)
> +		rte_free(*internals);
> +	return -1;
> +}
> +
> +static int
> +rte_eth_from_packet(const char *name,
> +                    int const *sockfd,
> +                    const unsigned numa_node,
> +                    struct rte_kvargs *kvlist)
> +{
> +	struct pmd_internals *internals = NULL;
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct rte_kvargs_pair *pair = NULL;
> +	unsigned k_idx;
> +	unsigned int blockcount;
> +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> +	unsigned int framesize = DFLT_FRAME_SIZE;
> +	unsigned int framecount = DFLT_FRAME_COUNT;
> +	unsigned int qpairs = 1;
> +
> +	/* do some parameter checking */
> +	if (*sockfd < 0)
> +		return -1;
> +
> +	/*
> +	 * Walk arguments for configurable settings
> +	 */
> +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> +		pair = &kvlist->pairs[k_idx];
> +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> +			qpairs = atoi(pair->value);
> +			if (qpairs < 1 ||
> +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid qpairs value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> +			blocksize = atoi(pair->value);
> +			if (!blocksize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid blocksize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> +			framesize = atoi(pair->value);
> +			if (!framesize) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framesize value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> +			framecount = atoi(pair->value);
> +			if (!framecount) {
> +				RTE_LOG(ERR, PMD,
> +					"%s: invalid framecount value\n",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}
> +	}
> +
> +	if (framesize > blocksize) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> +		        name);
> +		return -1;
> +	}
> +
> +	blockcount = framecount / (blocksize / framesize);
> +	if (!blockcount) {
> +		RTE_LOG(ERR, PMD,
> +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> +		return -1;
> +	}
> +
> +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> +
> +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> +	                           blocksize, blockcount,
> +	                           framesize, framecount,
> +	                           numa_node, &internals, &eth_dev,
> +	                           kvlist) < 0)
> +		return -1;
> +
> +	eth_dev->rx_pkt_burst = eth_packet_rx;
> +	eth_dev->tx_pkt_burst = eth_packet_tx;
> +
> +	return 0;
> +}
> +
> +int
> +rte_pmd_packet_devinit(const char *name, const char *params)
> +{
> +	unsigned numa_node;
> +	int ret;
> +	struct rte_kvargs *kvlist;
> +	int sockfd = -1;
> +
> +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> +
> +	numa_node = rte_socket_id();
> +
> +	kvlist = rte_kvargs_parse(params, valid_arguments);
> +	if (kvlist == NULL)
> +		return -1;
> +
> +	/*
> +	 * If iface argument is passed we open the NICs and use them for
> +	 * reading / writing
> +	 */
> +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> +
> +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> +		                         &open_packet_iface, &sockfd);
> +		if (ret < 0)
> +			return -1;
> +	}
> +
> +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> +	close(sockfd); /* no longer needed */
> +
> +	if (ret < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static struct rte_driver pmd_packet_drv = {
> +	.name = "eth_packet",
> +	.type = PMD_VDEV,
> +	.init = rte_pmd_packet_devinit,
> +};
> +
> +PMD_REGISTER_DRIVER(pmd_packet_drv);
> diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> new file mode 100644
> index 000000000000..f685611da3e9
> --- /dev/null
> +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> @@ -0,0 +1,55 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_ETH_PACKET_H_
> +#define _RTE_ETH_PACKET_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> +
> +#define RTE_PMD_PACKET_MAX_RINGS 16
> +
> +/**
> + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> + * configured on command line.
> + */
> +int rte_pmd_packet_devinit(const char *name, const char *params);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 34dff2a02a05..a6994c4dbe93 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
>  LDLIBS += -lrte_pmd_pcap -lpcap
>  endif
>  
> +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> +LDLIBS += -lrte_pmd_packet
> +endif
> +
>  endif # plugins
>  
>  LDLIBS += $(EXECENV_LDLIBS)
> -- 
> 1.9.3
> 
>
  
Zhou, Danny Sept. 12, 2014, 6:31 p.m. UTC | #10
I am concerned about its performance caused by too many memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy packets to skb, then af_packet copies packets to AF_PACKET buffer which are mapped to user space, and then those packets to be copied to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet copies which brings significant negative performance impact. We had a bifurcated driver prototype that can do zero-copy and achieve native DPDK performance, but it depends on base driver and AF_PACKET code changes in kernel, John R will be presenting it in coming Linux Plumbers Conference. Once kernel adopts it, the relevant PMD will be submitted to dpdk.org.

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> Sent: Saturday, September 13, 2014 2:05 AM
> To: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> Ping?  Are there objections to this patch from mid-July?
> 
> John
> 
> On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > AF_PACKET is used for frame reception.  In the current implementation,
> > Tx and Rx queues are always paired, and therefore are always equal
> > in number -- changing this would be a Simple Matter Of Programming.
> >
> > Interfaces of this type are created with a command line option like
> > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > as arguments:
> >
> >  - Interface is chosen by "iface" (required)
> >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> >
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > This PMD is intended to provide a means for using DPDK on a broad
> > range of hardware without hardware-specific PMDs and (hopefully)
> > with better performance than what PCAP offers in Linux.  This might
> > be useful as a development platform for DPDK applications when
> > DPDK-supported hardware is expensive or unavailable.
> >
> > New in v2:
> >
> > -- fixup some style issues found by check patch
> > -- use if_index as part of fanout group ID
> > -- set default number of queue pairs to 1
> >
> >  config/common_bsdapp                   |   5 +
> >  config/common_linuxapp                 |   5 +
> >  lib/Makefile                           |   1 +
> >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> >  lib/librte_pmd_packet/Makefile         |  60 +++
> >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> >  mk/rte.app.mk                          |   4 +
> >  8 files changed, 957 insertions(+)
> >  create mode 100644 lib/librte_pmd_packet/Makefile
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> >
> > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > index 943dce8f1ede..c317f031278e 100644
> > --- a/config/common_bsdapp
> > +++ b/config/common_bsdapp
> > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> >
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > +#
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > +
> > +#
> >  # Do prefetch of packet data within PMD driver receive function
> >  #
> >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > --- a/config/common_linuxapp
> > +++ b/config/common_linuxapp
> > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> >
> >  #
> > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > +#
> > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > +
> > +#
> >  # Compile Xen PMD
> >  #
> >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 10c5bb3045bc..930fadf29898 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > index 756d6b0c9301..feed24a63272 100644
> > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> >  CFLAGS += $(WERROR_FLAGS) -O3
> >
> > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > new file mode 100644
> > index 000000000000..e1266fb992cd
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/Makefile
> > @@ -0,0 +1,60 @@
> > +#   BSD LICENSE
> > +#
> > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > +#   Copyright(c) 2014 6WIND S.A.
> > +#   All rights reserved.
> > +#
> > +#   Redistribution and use in source and binary forms, with or without
> > +#   modification, are permitted provided that the following conditions
> > +#   are met:
> > +#
> > +#     * Redistributions of source code must retain the above copyright
> > +#       notice, this list of conditions and the following disclaimer.
> > +#     * Redistributions in binary form must reproduce the above copyright
> > +#       notice, this list of conditions and the following disclaimer in
> > +#       the documentation and/or other materials provided with the
> > +#       distribution.
> > +#     * Neither the name of Intel Corporation nor the names of its
> > +#       contributors may be used to endorse or promote products derived
> > +#       from this software without specific prior written permission.
> > +#
> > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +#
> > +# library name
> > +#
> > +LIB = librte_pmd_packet.a
> > +
> > +CFLAGS += -O3
> > +CFLAGS += $(WERROR_FLAGS)
> > +
> > +#
> > +# all source are stored in SRCS-y
> > +#
> > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > +
> > +#
> > +# Export include files
> > +#
> > +SYMLINK-y-include += rte_eth_packet.h
> > +
> > +# this lib depends upon:
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > new file mode 100644
> > index 000000000000..9c82d16e730f
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > @@ -0,0 +1,826 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > + *
> > + *   Originally based upon librte_pmd_pcap code:
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   Copyright(c) 2014 6WIND S.A.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#include <rte_mbuf.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_malloc.h>
> > +#include <rte_kvargs.h>
> > +#include <rte_dev.h>
> > +
> > +#include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> > +#include <arpa/inet.h>
> > +#include <net/if.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <poll.h>
> > +
> > +#include "rte_eth_packet.h"
> > +
> > +#define ETH_PACKET_IFACE_ARG		"iface"
> > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > +
> > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > +#define DFLT_FRAME_SIZE		(1 << 11)
> > +#define DFLT_FRAME_COUNT	(1 << 9)
> > +
> > +struct pkt_rx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	struct rte_mempool *mb_pool;
> > +
> > +	volatile unsigned long rx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pkt_tx_queue {
> > +	int sockfd;
> > +
> > +	struct iovec *rd;
> > +	uint8_t *map;
> > +	unsigned int framecount;
> > +	unsigned int framenum;
> > +
> > +	volatile unsigned long tx_pkts;
> > +	volatile unsigned long err_pkts;
> > +};
> > +
> > +struct pmd_internals {
> > +	unsigned nb_queues;
> > +
> > +	int if_index;
> > +	struct ether_addr eth_addr;
> > +
> > +	struct tpacket_req req;
> > +
> > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > +};
> > +
> > +static const char *valid_arguments[] = {
> > +	ETH_PACKET_IFACE_ARG,
> > +	ETH_PACKET_NUM_Q_ARG,
> > +	ETH_PACKET_BLOCKSIZE_ARG,
> > +	ETH_PACKET_FRAMESIZE_ARG,
> > +	ETH_PACKET_FRAMECOUNT_ARG,
> > +	NULL
> > +};
> > +
> > +static const char *drivername = "AF_PACKET PMD";
> > +
> > +static struct rte_eth_link pmd_link = {
> > +	.link_speed = 10000,
> > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > +	.link_status = 0
> > +};
> > +
> > +static uint16_t
> > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > +{
> > +	unsigned i;
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	struct pkt_rx_queue *pkt_q = queue;
> > +	uint16_t num_rx = 0;
> > +	unsigned int framecount, framenum;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	/*
> > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > +	 * one and copies the packet data into a newly allocated mbuf.
> > +	 */
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > +			break;
> > +
> > +		/* allocate the next mbuf */
> > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > +		if (unlikely(mbuf == NULL))
> > +			break;
> > +
> > +		/* packet will fit in the mbuf, go ahead and receive it */
> > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_KERNEL;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +
> > +		/* account for the receive frame */
> > +		bufs[i] = mbuf;
> > +		num_rx++;
> > +	}
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->rx_pkts += num_rx;
> > +	return num_rx;
> > +}
> > +
> > +/*
> > + * Callback to handle sending packets through a real NIC.
> > + */
> > +static uint16_t
> > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > +{
> > +	struct tpacket2_hdr *ppd;
> > +	struct rte_mbuf *mbuf;
> > +	uint8_t *pbuf;
> > +	unsigned int framecount, framenum;
> > +	struct pollfd pfd;
> > +	struct pkt_tx_queue *pkt_q = queue;
> > +	uint16_t num_tx = 0;
> > +	int i;
> > +
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	memset(&pfd, 0, sizeof(pfd));
> > +	pfd.fd = pkt_q->sockfd;
> > +	pfd.events = POLLOUT;
> > +	pfd.revents = 0;
> > +
> > +	framecount = pkt_q->framecount;
> > +	framenum = pkt_q->framenum;
> > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		/* point at the next incoming frame */
> > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > +		    (poll(&pfd, 1, -1) < 0))
> > +				continue;
> > +
> > +		/* copy the tx frame data */
> > +		mbuf = bufs[num_tx];
> > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > +			sizeof(struct sockaddr_ll);
> > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > +
> > +		/* release incoming frame and advance ring buffer */
> > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > +		if (++framenum >= framecount)
> > +			framenum = 0;
> > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > +
> > +		num_tx++;
> > +		rte_pktmbuf_free(mbuf);
> > +	}
> > +
> > +	/* kick-off transmits */
> > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > +
> > +	pkt_q->framenum = framenum;
> > +	pkt_q->tx_pkts += num_tx;
> > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > +	return num_tx;
> > +}
> > +
> > +static int
> > +eth_dev_start(struct rte_eth_dev *dev)
> > +{
> > +	dev->data->dev_link.link_status = 1;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * This function gets called when the current port gets stopped.
> > + */
> > +static void
> > +eth_dev_stop(struct rte_eth_dev *dev)
> > +{
> > +	unsigned i;
> > +	int sockfd;
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internals->nb_queues; i++) {
> > +		sockfd = internals->rx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +		sockfd = internals->tx_queue[i].sockfd;
> > +		if (sockfd != -1)
> > +			close(sockfd);
> > +	}
> > +
> > +	dev->data->dev_link.link_status = 0;
> > +}
> > +
> > +static int
> > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > +{
> > +	return 0;
> > +}
> > +
> > +static void
> > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > +{
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev_info->driver_name = drivername;
> > +	dev_info->if_index = internals->if_index;
> > +	dev_info->max_mac_addrs = 1;
> > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > +	dev_info->min_rx_bufsize = 0;
> > +	dev_info->pci_dev = NULL;
> > +}
> > +
> > +static void
> > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > +{
> > +	unsigned i, imax;
> > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > +	const struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > +		rx_total += igb_stats->q_ipackets[i];
> > +	}
> > +
> > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > +	for (i = 0; i < imax; i++) {
> > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > +		tx_total += igb_stats->q_opackets[i];
> > +		tx_err_total += igb_stats->q_errors[i];
> > +	}
> > +
> > +	igb_stats->ipackets = rx_total;
> > +	igb_stats->opackets = tx_total;
> > +	igb_stats->oerrors = tx_err_total;
> > +}
> > +
> > +static void
> > +eth_stats_reset(struct rte_eth_dev *dev)
> > +{
> > +	unsigned i;
> > +	struct pmd_internals *internal = dev->data->dev_private;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++)
> > +		internal->rx_queue[i].rx_pkts = 0;
> > +
> > +	for (i = 0; i < internal->nb_queues; i++) {
> > +		internal->tx_queue[i].tx_pkts = 0;
> > +		internal->tx_queue[i].err_pkts = 0;
> > +	}
> > +}
> > +
> > +static void
> > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > +{
> > +}
> > +
> > +static void
> > +eth_queue_release(void *q __rte_unused)
> > +{
> > +}
> > +
> > +static int
> > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > +                int wait_to_complete __rte_unused)
> > +{
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t rx_queue_id,
> > +                   uint16_t nb_rx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > +                   struct rte_mempool *mb_pool)
> > +{
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > +	uint16_t buf_size;
> > +
> > +	pkt_q->mb_pool = mb_pool;
> > +
> > +	/* Now get the space available for data in the mbuf */
> > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > +	                       RTE_PKTMBUF_HEADROOM);
> > +
> > +	if (ETH_FRAME_LEN > buf_size) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > +                   uint16_t tx_queue_id,
> > +                   uint16_t nb_tx_desc __rte_unused,
> > +                   unsigned int socket_id __rte_unused,
> > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > +{
> > +
> > +	struct pmd_internals *internals = dev->data->dev_private;
> > +
> > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > +	return 0;
> > +}
> > +
> > +static struct eth_dev_ops ops = {
> > +	.dev_start = eth_dev_start,
> > +	.dev_stop = eth_dev_stop,
> > +	.dev_close = eth_dev_close,
> > +	.dev_configure = eth_dev_configure,
> > +	.dev_infos_get = eth_dev_info,
> > +	.rx_queue_setup = eth_rx_queue_setup,
> > +	.tx_queue_setup = eth_tx_queue_setup,
> > +	.rx_queue_release = eth_queue_release,
> > +	.tx_queue_release = eth_queue_release,
> > +	.link_update = eth_link_update,
> > +	.stats_get = eth_stats_get,
> > +	.stats_reset = eth_stats_reset,
> > +};
> > +
> > +/*
> > + * Opens an AF_PACKET socket
> > + */
> > +static int
> > +open_packet_iface(const char *key __rte_unused,
> > +                  const char *value __rte_unused,
> > +                  void *extra_args)
> > +{
> > +	int *sockfd = extra_args;
> > +
> > +	/* Open an AF_PACKET socket... */
> > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +	if (*sockfd == -1) {
> > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > +		return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +rte_pmd_init_internals(const char *name,
> > +                       const int sockfd,
> > +                       const unsigned nb_queues,
> > +                       unsigned int blocksize,
> > +                       unsigned int blockcnt,
> > +                       unsigned int framesize,
> > +                       unsigned int framecnt,
> > +                       const unsigned numa_node,
> > +                       struct pmd_internals **internals,
> > +                       struct rte_eth_dev **eth_dev,
> > +                       struct rte_kvargs *kvlist)
> > +{
> > +	struct rte_eth_dev_data *data = NULL;
> > +	struct rte_pci_device *pci_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	struct ifreq ifr;
> > +	size_t ifnamelen;
> > +	unsigned k_idx;
> > +	struct sockaddr_ll sockaddr;
> > +	struct tpacket_req *req;
> > +	struct pkt_rx_queue *rx_queue;
> > +	struct pkt_tx_queue *tx_queue;
> > +	int rc, tpver, discard, bypass;
> > +	unsigned int i, q, rdsize;
> > +	int qsockfd, fanout_arg;
> > +
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > +			break;
> > +	}
> > +	if (pair == NULL) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD,
> > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > +		name, numa_node);
> > +
> > +	/*
> > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > +	 * and internal (private) data
> > +	 */
> > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > +	if (data == NULL)
> > +		goto error;
> > +
> > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > +	if (pci_dev == NULL)
> > +		goto error;
> > +
> > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > +	                                0, numa_node);
> > +	if (*internals == NULL)
> > +		goto error;
> > +
> > +	req = &((*internals)->req);
> > +
> > +	req->tp_block_size = blocksize;
> > +	req->tp_block_nr = blockcnt;
> > +	req->tp_frame_size = framesize;
> > +	req->tp_frame_nr = framecnt;
> > +
> > +	ifnamelen = strlen(pair->value);
> > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > +		ifr.ifr_name[ifnamelen] = '\0';
> > +	} else {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: I/F name too long (%s)\n",
> > +			name, pair->value);
> > +		goto error;
> > +	}
> > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	(*internals)->if_index = ifr.ifr_ifindex;
> > +
> > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > +		        name);
> > +		goto error;
> > +	}
> > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > +
> > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > +	sockaddr.sll_family = AF_PACKET;
> > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > +
> > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > +
> > +	for (q = 0; q < nb_queues; q++) {
> > +		/* Open an AF_PACKET socket for this queue... */
> > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > +		if (qsockfd == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +			        "%s: could not open AF_PACKET socket\n",
> > +			        name);
> > +			return -1;
> > +		}
> > +
> > +		tpver = TPACKET_V2;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > +				&tpver, sizeof(tpver));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		discard = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > +				&discard, sizeof(discard));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_LOSS on "
> > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		bypass = 1;
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > +				&bypass, sizeof(bypass));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_QDISC_BYPASS "
> > +			        "on AF_PACKET socket for %s\n", name,
> > +			        pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > +				"socket for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rx_queue = &((*internals)->rx_queue[q]);
> > +		rx_queue->framecount = req->tp_frame_nr;
> > +
> > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > +				    qsockfd, 0);
> > +		if (rx_queue->map == MAP_FAILED) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > +				name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		/* rdsize is same for both Tx and Rx */
> > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > +
> > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		rx_queue->sockfd = qsockfd;
> > +
> > +		tx_queue = &((*internals)->tx_queue[q]);
> > +		tx_queue->framecount = req->tp_frame_nr;
> > +
> > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > +
> > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > +		}
> > +		tx_queue->sockfd = qsockfd;
> > +
> > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not bind AF_PACKET socket to %s\n",
> > +			        name, pair->value);
> > +			goto error;
> > +		}
> > +
> > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > +				&fanout_arg, sizeof(fanout_arg));
> > +		if (rc == -1) {
> > +			RTE_LOG(ERR, PMD,
> > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > +				"for %s\n", name, pair->value);
> > +			goto error;
> > +		}
> > +	}
> > +
> > +	/* reserve an ethdev entry */
> > +	*eth_dev = rte_eth_dev_allocate(name);
> > +	if (*eth_dev == NULL)
> > +		goto error;
> > +
> > +	/*
> > +	 * now put it all together
> > +	 * - store queue data in internals,
> > +	 * - store numa_node info in pci_driver
> > +	 * - point eth_dev_data to internals and pci_driver
> > +	 * - and point eth_dev structure to new eth_dev_data structure
> > +	 */
> > +
> > +	(*internals)->nb_queues = nb_queues;
> > +
> > +	data->dev_private = *internals;
> > +	data->port_id = (*eth_dev)->data->port_id;
> > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > +	data->dev_link = pmd_link;
> > +	data->mac_addrs = &(*internals)->eth_addr;
> > +
> > +	pci_dev->numa_node = numa_node;
> > +
> > +	(*eth_dev)->data = data;
> > +	(*eth_dev)->dev_ops = &ops;
> > +	(*eth_dev)->pci_dev = pci_dev;
> > +
> > +	return 0;
> > +
> > +error:
> > +	if (data)
> > +		rte_free(data);
> > +	if (pci_dev)
> > +		rte_free(pci_dev);
> > +	for (q = 0; q < nb_queues; q++) {
> > +		if ((*internals)->rx_queue[q].rd)
> > +			rte_free((*internals)->rx_queue[q].rd);
> > +		if ((*internals)->tx_queue[q].rd)
> > +			rte_free((*internals)->tx_queue[q].rd);
> > +	}
> > +	if (*internals)
> > +		rte_free(*internals);
> > +	return -1;
> > +}
> > +
> > +static int
> > +rte_eth_from_packet(const char *name,
> > +                    int const *sockfd,
> > +                    const unsigned numa_node,
> > +                    struct rte_kvargs *kvlist)
> > +{
> > +	struct pmd_internals *internals = NULL;
> > +	struct rte_eth_dev *eth_dev = NULL;
> > +	struct rte_kvargs_pair *pair = NULL;
> > +	unsigned k_idx;
> > +	unsigned int blockcount;
> > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > +	unsigned int qpairs = 1;
> > +
> > +	/* do some parameter checking */
> > +	if (*sockfd < 0)
> > +		return -1;
> > +
> > +	/*
> > +	 * Walk arguments for configurable settings
> > +	 */
> > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > +		pair = &kvlist->pairs[k_idx];
> > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > +			qpairs = atoi(pair->value);
> > +			if (qpairs < 1 ||
> > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid qpairs value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > +			blocksize = atoi(pair->value);
> > +			if (!blocksize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid blocksize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > +			framesize = atoi(pair->value);
> > +			if (!framesize) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framesize value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > +			framecount = atoi(pair->value);
> > +			if (!framecount) {
> > +				RTE_LOG(ERR, PMD,
> > +					"%s: invalid framecount value\n",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> > +	}
> > +
> > +	if (framesize > blocksize) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > +		        name);
> > +		return -1;
> > +	}
> > +
> > +	blockcount = framecount / (blocksize / framesize);
> > +	if (!blockcount) {
> > +		RTE_LOG(ERR, PMD,
> > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > +		return -1;
> > +	}
> > +
> > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > +
> > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > +	                           blocksize, blockcount,
> > +	                           framesize, framecount,
> > +	                           numa_node, &internals, &eth_dev,
> > +	                           kvlist) < 0)
> > +		return -1;
> > +
> > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_pmd_packet_devinit(const char *name, const char *params)
> > +{
> > +	unsigned numa_node;
> > +	int ret;
> > +	struct rte_kvargs *kvlist;
> > +	int sockfd = -1;
> > +
> > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > +
> > +	numa_node = rte_socket_id();
> > +
> > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > +	if (kvlist == NULL)
> > +		return -1;
> > +
> > +	/*
> > +	 * If iface argument is passed we open the NICs and use them for
> > +	 * reading / writing
> > +	 */
> > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > +
> > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > +		                         &open_packet_iface, &sockfd);
> > +		if (ret < 0)
> > +			return -1;
> > +	}
> > +
> > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > +	close(sockfd); /* no longer needed */
> > +
> > +	if (ret < 0)
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct rte_driver pmd_packet_drv = {
> > +	.name = "eth_packet",
> > +	.type = PMD_VDEV,
> > +	.init = rte_pmd_packet_devinit,
> > +};
> > +
> > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > new file mode 100644
> > index 000000000000..f685611da3e9
> > --- /dev/null
> > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > @@ -0,0 +1,55 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#ifndef _RTE_ETH_PACKET_H_
> > +#define _RTE_ETH_PACKET_H_
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > +
> > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > +
> > +/**
> > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > + * configured on command line.
> > + */
> > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif
> > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > index 34dff2a02a05..a6994c4dbe93 100644
> > --- a/mk/rte.app.mk
> > +++ b/mk/rte.app.mk
> > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> >  LDLIBS += -lrte_pmd_pcap -lpcap
> >  endif
> >
> > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > +LDLIBS += -lrte_pmd_packet
> > +endif
> > +
> >  endif # plugins
> >
> >  LDLIBS += $(EXECENV_LDLIBS)
> > --
> > 1.9.3
> >
> >
> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.
  
John W. Linville Sept. 12, 2014, 6:54 p.m. UTC | #11
On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> I am concerned about its performance caused by too many
> memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> packets to skb, then af_packet copies packets to AF_PACKET buffer
> which are mapped to user space, and then those packets to be copied
> to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> copies which brings significant negative performance impact. We
> had a bifurcated driver prototype that can do zero-copy and achieve
> native DPDK performance, but it depends on base driver and AF_PACKET
> code changes in kernel, John R will be presenting it in coming Linux
> Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> submitted to dpdk.org.

Admittedly, this is not as good a performer as most of the existing
PMDs.  It serves a different purpose, afterall.  FWIW, you did
previously indicate that it performed better than the pcap-based PMD.

I look forward to seeing the changes you mention -- they sound very
exciting.  But, they will still require both networking core and
driver changes in the kernel.  And as I understand things today,
the userland code will still need at least some knowledge of specific
devices and how they layout their packet descriptors, etc.  So while
those changes sound very promising, they will still have certain
drawbacks in common with the current situation.

It seems like the changes you mention will still need some sort of
AF_PACKET-based PMD driver.  Have you implemented that completely
separate from the code I already posted?  Or did you add that work
on top of mine?

John

> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > Sent: Saturday, September 13, 2014 2:05 AM
> > To: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > 
> > Ping?  Are there objections to this patch from mid-July?
> > 
> > John
> > 
> > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > AF_PACKET is used for frame reception.  In the current implementation,
> > > Tx and Rx queues are always paired, and therefore are always equal
> > > in number -- changing this would be a Simple Matter Of Programming.
> > >
> > > Interfaces of this type are created with a command line option like
> > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > as arguments:
> > >
> > >  - Interface is chosen by "iface" (required)
> > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > >
> > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > ---
> > > This PMD is intended to provide a means for using DPDK on a broad
> > > range of hardware without hardware-specific PMDs and (hopefully)
> > > with better performance than what PCAP offers in Linux.  This might
> > > be useful as a development platform for DPDK applications when
> > > DPDK-supported hardware is expensive or unavailable.
> > >
> > > New in v2:
> > >
> > > -- fixup some style issues found by check patch
> > > -- use if_index as part of fanout group ID
> > > -- set default number of queue pairs to 1
> > >
> > >  config/common_bsdapp                   |   5 +
> > >  config/common_linuxapp                 |   5 +
> > >  lib/Makefile                           |   1 +
> > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > >  mk/rte.app.mk                          |   4 +
> > >  8 files changed, 957 insertions(+)
> > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > >
> > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > index 943dce8f1ede..c317f031278e 100644
> > > --- a/config/common_bsdapp
> > > +++ b/config/common_bsdapp
> > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > +#
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > +
> > > +#
> > >  # Do prefetch of packet data within PMD driver receive function
> > >  #
> > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > --- a/config/common_linuxapp
> > > +++ b/config/common_linuxapp
> > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > >
> > >  #
> > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > +#
> > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > +
> > > +#
> > >  # Compile Xen PMD
> > >  #
> > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 10c5bb3045bc..930fadf29898 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > index 756d6b0c9301..feed24a63272 100644
> > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > >  CFLAGS += $(WERROR_FLAGS) -O3
> > >
> > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > new file mode 100644
> > > index 000000000000..e1266fb992cd
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/Makefile
> > > @@ -0,0 +1,60 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > +#   Copyright(c) 2014 6WIND S.A.
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +#
> > > +# library name
> > > +#
> > > +LIB = librte_pmd_packet.a
> > > +
> > > +CFLAGS += -O3
> > > +CFLAGS += $(WERROR_FLAGS)
> > > +
> > > +#
> > > +# all source are stored in SRCS-y
> > > +#
> > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > +
> > > +#
> > > +# Export include files
> > > +#
> > > +SYMLINK-y-include += rte_eth_packet.h
> > > +
> > > +# this lib depends upon:
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > new file mode 100644
> > > index 000000000000..9c82d16e730f
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > @@ -0,0 +1,826 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > + *
> > > + *   Originally based upon librte_pmd_pcap code:
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   Copyright(c) 2014 6WIND S.A.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > + */
> > > +
> > > +#include <rte_mbuf.h>
> > > +#include <rte_ethdev.h>
> > > +#include <rte_malloc.h>
> > > +#include <rte_kvargs.h>
> > > +#include <rte_dev.h>
> > > +
> > > +#include <linux/if_ether.h>
> > > +#include <linux/if_packet.h>
> > > +#include <arpa/inet.h>
> > > +#include <net/if.h>
> > > +#include <sys/types.h>
> > > +#include <sys/socket.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +#include <poll.h>
> > > +
> > > +#include "rte_eth_packet.h"
> > > +
> > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > +
> > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > +
> > > +struct pkt_rx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	struct rte_mempool *mb_pool;
> > > +
> > > +	volatile unsigned long rx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pkt_tx_queue {
> > > +	int sockfd;
> > > +
> > > +	struct iovec *rd;
> > > +	uint8_t *map;
> > > +	unsigned int framecount;
> > > +	unsigned int framenum;
> > > +
> > > +	volatile unsigned long tx_pkts;
> > > +	volatile unsigned long err_pkts;
> > > +};
> > > +
> > > +struct pmd_internals {
> > > +	unsigned nb_queues;
> > > +
> > > +	int if_index;
> > > +	struct ether_addr eth_addr;
> > > +
> > > +	struct tpacket_req req;
> > > +
> > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > +};
> > > +
> > > +static const char *valid_arguments[] = {
> > > +	ETH_PACKET_IFACE_ARG,
> > > +	ETH_PACKET_NUM_Q_ARG,
> > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > +	NULL
> > > +};
> > > +
> > > +static const char *drivername = "AF_PACKET PMD";
> > > +
> > > +static struct rte_eth_link pmd_link = {
> > > +	.link_speed = 10000,
> > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > +	.link_status = 0
> > > +};
> > > +
> > > +static uint16_t
> > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > +{
> > > +	unsigned i;
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	struct pkt_rx_queue *pkt_q = queue;
> > > +	uint16_t num_rx = 0;
> > > +	unsigned int framecount, framenum;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > +	 */
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > +			break;
> > > +
> > > +		/* allocate the next mbuf */
> > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > +		if (unlikely(mbuf == NULL))
> > > +			break;
> > > +
> > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +
> > > +		/* account for the receive frame */
> > > +		bufs[i] = mbuf;
> > > +		num_rx++;
> > > +	}
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->rx_pkts += num_rx;
> > > +	return num_rx;
> > > +}
> > > +
> > > +/*
> > > + * Callback to handle sending packets through a real NIC.
> > > + */
> > > +static uint16_t
> > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > +{
> > > +	struct tpacket2_hdr *ppd;
> > > +	struct rte_mbuf *mbuf;
> > > +	uint8_t *pbuf;
> > > +	unsigned int framecount, framenum;
> > > +	struct pollfd pfd;
> > > +	struct pkt_tx_queue *pkt_q = queue;
> > > +	uint16_t num_tx = 0;
> > > +	int i;
> > > +
> > > +	if (unlikely(nb_pkts == 0))
> > > +		return 0;
> > > +
> > > +	memset(&pfd, 0, sizeof(pfd));
> > > +	pfd.fd = pkt_q->sockfd;
> > > +	pfd.events = POLLOUT;
> > > +	pfd.revents = 0;
> > > +
> > > +	framecount = pkt_q->framecount;
> > > +	framenum = pkt_q->framenum;
> > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +	for (i = 0; i < nb_pkts; i++) {
> > > +		/* point at the next incoming frame */
> > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > +		    (poll(&pfd, 1, -1) < 0))
> > > +				continue;
> > > +
> > > +		/* copy the tx frame data */
> > > +		mbuf = bufs[num_tx];
> > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > +			sizeof(struct sockaddr_ll);
> > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > +
> > > +		/* release incoming frame and advance ring buffer */
> > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > +		if (++framenum >= framecount)
> > > +			framenum = 0;
> > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > +
> > > +		num_tx++;
> > > +		rte_pktmbuf_free(mbuf);
> > > +	}
> > > +
> > > +	/* kick-off transmits */
> > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > +
> > > +	pkt_q->framenum = framenum;
> > > +	pkt_q->tx_pkts += num_tx;
> > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > +	return num_tx;
> > > +}
> > > +
> > > +static int
> > > +eth_dev_start(struct rte_eth_dev *dev)
> > > +{
> > > +	dev->data->dev_link.link_status = 1;
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * This function gets called when the current port gets stopped.
> > > + */
> > > +static void
> > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > +{
> > > +	unsigned i;
> > > +	int sockfd;
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > +		sockfd = internals->rx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +		sockfd = internals->tx_queue[i].sockfd;
> > > +		if (sockfd != -1)
> > > +			close(sockfd);
> > > +	}
> > > +
> > > +	dev->data->dev_link.link_status = 0;
> > > +}
> > > +
> > > +static int
> > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static void
> > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > +{
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev_info->driver_name = drivername;
> > > +	dev_info->if_index = internals->if_index;
> > > +	dev_info->max_mac_addrs = 1;
> > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > +	dev_info->min_rx_bufsize = 0;
> > > +	dev_info->pci_dev = NULL;
> > > +}
> > > +
> > > +static void
> > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > +{
> > > +	unsigned i, imax;
> > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > +		rx_total += igb_stats->q_ipackets[i];
> > > +	}
> > > +
> > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > +	for (i = 0; i < imax; i++) {
> > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > +		tx_total += igb_stats->q_opackets[i];
> > > +		tx_err_total += igb_stats->q_errors[i];
> > > +	}
> > > +
> > > +	igb_stats->ipackets = rx_total;
> > > +	igb_stats->opackets = tx_total;
> > > +	igb_stats->oerrors = tx_err_total;
> > > +}
> > > +
> > > +static void
> > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > +{
> > > +	unsigned i;
> > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++)
> > > +		internal->rx_queue[i].rx_pkts = 0;
> > > +
> > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > +		internal->tx_queue[i].tx_pkts = 0;
> > > +		internal->tx_queue[i].err_pkts = 0;
> > > +	}
> > > +}
> > > +
> > > +static void
> > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > +{
> > > +}
> > > +
> > > +static void
> > > +eth_queue_release(void *q __rte_unused)
> > > +{
> > > +}
> > > +
> > > +static int
> > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > +                int wait_to_complete __rte_unused)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t rx_queue_id,
> > > +                   uint16_t nb_rx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > +                   struct rte_mempool *mb_pool)
> > > +{
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > +	uint16_t buf_size;
> > > +
> > > +	pkt_q->mb_pool = mb_pool;
> > > +
> > > +	/* Now get the space available for data in the mbuf */
> > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > +	                       RTE_PKTMBUF_HEADROOM);
> > > +
> > > +	if (ETH_FRAME_LEN > buf_size) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > +                   uint16_t tx_queue_id,
> > > +                   uint16_t nb_tx_desc __rte_unused,
> > > +                   unsigned int socket_id __rte_unused,
> > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > +{
> > > +
> > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > +
> > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > +	return 0;
> > > +}
> > > +
> > > +static struct eth_dev_ops ops = {
> > > +	.dev_start = eth_dev_start,
> > > +	.dev_stop = eth_dev_stop,
> > > +	.dev_close = eth_dev_close,
> > > +	.dev_configure = eth_dev_configure,
> > > +	.dev_infos_get = eth_dev_info,
> > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > +	.rx_queue_release = eth_queue_release,
> > > +	.tx_queue_release = eth_queue_release,
> > > +	.link_update = eth_link_update,
> > > +	.stats_get = eth_stats_get,
> > > +	.stats_reset = eth_stats_reset,
> > > +};
> > > +
> > > +/*
> > > + * Opens an AF_PACKET socket
> > > + */
> > > +static int
> > > +open_packet_iface(const char *key __rte_unused,
> > > +                  const char *value __rte_unused,
> > > +                  void *extra_args)
> > > +{
> > > +	int *sockfd = extra_args;
> > > +
> > > +	/* Open an AF_PACKET socket... */
> > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +	if (*sockfd == -1) {
> > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +rte_pmd_init_internals(const char *name,
> > > +                       const int sockfd,
> > > +                       const unsigned nb_queues,
> > > +                       unsigned int blocksize,
> > > +                       unsigned int blockcnt,
> > > +                       unsigned int framesize,
> > > +                       unsigned int framecnt,
> > > +                       const unsigned numa_node,
> > > +                       struct pmd_internals **internals,
> > > +                       struct rte_eth_dev **eth_dev,
> > > +                       struct rte_kvargs *kvlist)
> > > +{
> > > +	struct rte_eth_dev_data *data = NULL;
> > > +	struct rte_pci_device *pci_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	struct ifreq ifr;
> > > +	size_t ifnamelen;
> > > +	unsigned k_idx;
> > > +	struct sockaddr_ll sockaddr;
> > > +	struct tpacket_req *req;
> > > +	struct pkt_rx_queue *rx_queue;
> > > +	struct pkt_tx_queue *tx_queue;
> > > +	int rc, tpver, discard, bypass;
> > > +	unsigned int i, q, rdsize;
> > > +	int qsockfd, fanout_arg;
> > > +
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > +			break;
> > > +	}
> > > +	if (pair == NULL) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD,
> > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > +		name, numa_node);
> > > +
> > > +	/*
> > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > +	 * and internal (private) data
> > > +	 */
> > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > +	if (data == NULL)
> > > +		goto error;
> > > +
> > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > +	if (pci_dev == NULL)
> > > +		goto error;
> > > +
> > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > +	                                0, numa_node);
> > > +	if (*internals == NULL)
> > > +		goto error;
> > > +
> > > +	req = &((*internals)->req);
> > > +
> > > +	req->tp_block_size = blocksize;
> > > +	req->tp_block_nr = blockcnt;
> > > +	req->tp_frame_size = framesize;
> > > +	req->tp_frame_nr = framecnt;
> > > +
> > > +	ifnamelen = strlen(pair->value);
> > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > +	} else {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: I/F name too long (%s)\n",
> > > +			name, pair->value);
> > > +		goto error;
> > > +	}
> > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > +
> > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > +		        name);
> > > +		goto error;
> > > +	}
> > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > +
> > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > +	sockaddr.sll_family = AF_PACKET;
> > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > +
> > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > +
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		/* Open an AF_PACKET socket for this queue... */
> > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > +		if (qsockfd == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +			        "%s: could not open AF_PACKET socket\n",
> > > +			        name);
> > > +			return -1;
> > > +		}
> > > +
> > > +		tpver = TPACKET_V2;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > +				&tpver, sizeof(tpver));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		discard = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > +				&discard, sizeof(discard));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_LOSS on "
> > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		bypass = 1;
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > +				&bypass, sizeof(bypass));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > +			        "on AF_PACKET socket for %s\n", name,
> > > +			        pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > +				"socket for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > +		rx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > +				    qsockfd, 0);
> > > +		if (rx_queue->map == MAP_FAILED) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > +				name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		/* rdsize is same for both Tx and Rx */
> > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > +
> > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		rx_queue->sockfd = qsockfd;
> > > +
> > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > +		tx_queue->framecount = req->tp_frame_nr;
> > > +
> > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > +
> > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > +		}
> > > +		tx_queue->sockfd = qsockfd;
> > > +
> > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > +			        name, pair->value);
> > > +			goto error;
> > > +		}
> > > +
> > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > +				&fanout_arg, sizeof(fanout_arg));
> > > +		if (rc == -1) {
> > > +			RTE_LOG(ERR, PMD,
> > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > +				"for %s\n", name, pair->value);
> > > +			goto error;
> > > +		}
> > > +	}
> > > +
> > > +	/* reserve an ethdev entry */
> > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > +	if (*eth_dev == NULL)
> > > +		goto error;
> > > +
> > > +	/*
> > > +	 * now put it all together
> > > +	 * - store queue data in internals,
> > > +	 * - store numa_node info in pci_driver
> > > +	 * - point eth_dev_data to internals and pci_driver
> > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > +	 */
> > > +
> > > +	(*internals)->nb_queues = nb_queues;
> > > +
> > > +	data->dev_private = *internals;
> > > +	data->port_id = (*eth_dev)->data->port_id;
> > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > +	data->dev_link = pmd_link;
> > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > +
> > > +	pci_dev->numa_node = numa_node;
> > > +
> > > +	(*eth_dev)->data = data;
> > > +	(*eth_dev)->dev_ops = &ops;
> > > +	(*eth_dev)->pci_dev = pci_dev;
> > > +
> > > +	return 0;
> > > +
> > > +error:
> > > +	if (data)
> > > +		rte_free(data);
> > > +	if (pci_dev)
> > > +		rte_free(pci_dev);
> > > +	for (q = 0; q < nb_queues; q++) {
> > > +		if ((*internals)->rx_queue[q].rd)
> > > +			rte_free((*internals)->rx_queue[q].rd);
> > > +		if ((*internals)->tx_queue[q].rd)
> > > +			rte_free((*internals)->tx_queue[q].rd);
> > > +	}
> > > +	if (*internals)
> > > +		rte_free(*internals);
> > > +	return -1;
> > > +}
> > > +
> > > +static int
> > > +rte_eth_from_packet(const char *name,
> > > +                    int const *sockfd,
> > > +                    const unsigned numa_node,
> > > +                    struct rte_kvargs *kvlist)
> > > +{
> > > +	struct pmd_internals *internals = NULL;
> > > +	struct rte_eth_dev *eth_dev = NULL;
> > > +	struct rte_kvargs_pair *pair = NULL;
> > > +	unsigned k_idx;
> > > +	unsigned int blockcount;
> > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > +	unsigned int qpairs = 1;
> > > +
> > > +	/* do some parameter checking */
> > > +	if (*sockfd < 0)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * Walk arguments for configurable settings
> > > +	 */
> > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > +		pair = &kvlist->pairs[k_idx];
> > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > +			qpairs = atoi(pair->value);
> > > +			if (qpairs < 1 ||
> > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid qpairs value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > +			blocksize = atoi(pair->value);
> > > +			if (!blocksize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid blocksize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > +			framesize = atoi(pair->value);
> > > +			if (!framesize) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framesize value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > +			framecount = atoi(pair->value);
> > > +			if (!framecount) {
> > > +				RTE_LOG(ERR, PMD,
> > > +					"%s: invalid framecount value\n",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}
> > > +	}
> > > +
> > > +	if (framesize > blocksize) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > +		        name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	blockcount = framecount / (blocksize / framesize);
> > > +	if (!blockcount) {
> > > +		RTE_LOG(ERR, PMD,
> > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > +		return -1;
> > > +	}
> > > +
> > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > +
> > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > +	                           blocksize, blockcount,
> > > +	                           framesize, framecount,
> > > +	                           numa_node, &internals, &eth_dev,
> > > +	                           kvlist) < 0)
> > > +		return -1;
> > > +
> > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int
> > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > +{
> > > +	unsigned numa_node;
> > > +	int ret;
> > > +	struct rte_kvargs *kvlist;
> > > +	int sockfd = -1;
> > > +
> > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > +
> > > +	numa_node = rte_socket_id();
> > > +
> > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > +	if (kvlist == NULL)
> > > +		return -1;
> > > +
> > > +	/*
> > > +	 * If iface argument is passed we open the NICs and use them for
> > > +	 * reading / writing
> > > +	 */
> > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > +
> > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > +		                         &open_packet_iface, &sockfd);
> > > +		if (ret < 0)
> > > +			return -1;
> > > +	}
> > > +
> > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > +	close(sockfd); /* no longer needed */
> > > +
> > > +	if (ret < 0)
> > > +		return -1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static struct rte_driver pmd_packet_drv = {
> > > +	.name = "eth_packet",
> > > +	.type = PMD_VDEV,
> > > +	.init = rte_pmd_packet_devinit,
> > > +};
> > > +
> > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > new file mode 100644
> > > index 000000000000..f685611da3e9
> > > --- /dev/null
> > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > @@ -0,0 +1,55 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > + */
> > > +
> > > +#ifndef _RTE_ETH_PACKET_H_
> > > +#define _RTE_ETH_PACKET_H_
> > > +
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > > +
> > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > +
> > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > +
> > > +/**
> > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > + * configured on command line.
> > > + */
> > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > +
> > > +#ifdef __cplusplus
> > > +}
> > > +#endif
> > > +
> > > +#endif
> > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > index 34dff2a02a05..a6994c4dbe93 100644
> > > --- a/mk/rte.app.mk
> > > +++ b/mk/rte.app.mk
> > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > >  endif
> > >
> > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > +LDLIBS += -lrte_pmd_packet
> > > +endif
> > > +
> > >  endif # plugins
> > >
> > >  LDLIBS += $(EXECENV_LDLIBS)
> > > --
> > > 1.9.3
> > >
> > >
> > 
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
>
  
Zhou, Danny Sept. 12, 2014, 8:35 p.m. UTC | #12
> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Saturday, September 13, 2014 2:54 AM
> To: Zhou, Danny
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > I am concerned about its performance caused by too many
> > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > which are mapped to user space, and then those packets to be copied
> > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > copies which brings significant negative performance impact. We
> > had a bifurcated driver prototype that can do zero-copy and achieve
> > native DPDK performance, but it depends on base driver and AF_PACKET
> > code changes in kernel, John R will be presenting it in coming Linux
> > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > submitted to dpdk.org.
> 
> Admittedly, this is not as good a performer as most of the existing
> PMDs.  It serves a different purpose, afterall.  FWIW, you did
> previously indicate that it performed better than the pcap-based PMD.

Yes, slightly higher but makes no big difference.

> I look forward to seeing the changes you mention -- they sound very
> exciting.  But, they will still require both networking core and
> driver changes in the kernel.  And as I understand things today,
> the userland code will still need at least some knowledge of specific
> devices and how they layout their packet descriptors, etc.  So while
> those changes sound very promising, they will still have certain
> drawbacks in common with the current situation.

Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate device-specific 
packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will be much easier
to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.

> It seems like the changes you mention will still need some sort of
> AF_PACKET-based PMD driver.  Have you implemented that completely
> separate from the code I already posted?  Or did you add that work
> on top of mine?
> 

For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev library to do device
probe and support new socket options.

> John
> 
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > Sent: Saturday, September 13, 2014 2:05 AM
> > > To: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > >
> > > Ping?  Are there objections to this patch from mid-July?
> > >
> > > John
> > >
> > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > in number -- changing this would be a Simple Matter Of Programming.
> > > >
> > > > Interfaces of this type are created with a command line option like
> > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > as arguments:
> > > >
> > > >  - Interface is chosen by "iface" (required)
> > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > >
> > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > ---
> > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > with better performance than what PCAP offers in Linux.  This might
> > > > be useful as a development platform for DPDK applications when
> > > > DPDK-supported hardware is expensive or unavailable.
> > > >
> > > > New in v2:
> > > >
> > > > -- fixup some style issues found by check patch
> > > > -- use if_index as part of fanout group ID
> > > > -- set default number of queue pairs to 1
> > > >
> > > >  config/common_bsdapp                   |   5 +
> > > >  config/common_linuxapp                 |   5 +
> > > >  lib/Makefile                           |   1 +
> > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > >  mk/rte.app.mk                          |   4 +
> > > >  8 files changed, 957 insertions(+)
> > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > >
> > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > index 943dce8f1ede..c317f031278e 100644
> > > > --- a/config/common_bsdapp
> > > > +++ b/config/common_bsdapp
> > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > >
> > > >  #
> > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > +#
> > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > +
> > > > +#
> > > >  # Do prefetch of packet data within PMD driver receive function
> > > >  #
> > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > --- a/config/common_linuxapp
> > > > +++ b/config/common_linuxapp
> > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > >
> > > >  #
> > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > +#
> > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > +
> > > > +#
> > > >  # Compile Xen PMD
> > > >  #
> > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > index 10c5bb3045bc..930fadf29898 100644
> > > > --- a/lib/Makefile
> > > > +++ b/lib/Makefile
> > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > index 756d6b0c9301..feed24a63272 100644
> > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > >
> > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > new file mode 100644
> > > > index 000000000000..e1266fb992cd
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > @@ -0,0 +1,60 @@
> > > > +#   BSD LICENSE
> > > > +#
> > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > +#   All rights reserved.
> > > > +#
> > > > +#   Redistribution and use in source and binary forms, with or without
> > > > +#   modification, are permitted provided that the following conditions
> > > > +#   are met:
> > > > +#
> > > > +#     * Redistributions of source code must retain the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer.
> > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer in
> > > > +#       the documentation and/or other materials provided with the
> > > > +#       distribution.
> > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > +#       contributors may be used to endorse or promote products derived
> > > > +#       from this software without specific prior written permission.
> > > > +#
> > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > +
> > > > +#
> > > > +# library name
> > > > +#
> > > > +LIB = librte_pmd_packet.a
> > > > +
> > > > +CFLAGS += -O3
> > > > +CFLAGS += $(WERROR_FLAGS)
> > > > +
> > > > +#
> > > > +# all source are stored in SRCS-y
> > > > +#
> > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > +
> > > > +#
> > > > +# Export include files
> > > > +#
> > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > +
> > > > +# this lib depends upon:
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > new file mode 100644
> > > > index 000000000000..9c82d16e730f
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > @@ -0,0 +1,826 @@
> > > > +/*-
> > > > + *   BSD LICENSE
> > > > + *
> > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > + *
> > > > + *   Originally based upon librte_pmd_pcap code:
> > > > + *
> > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > + *   All rights reserved.
> > > > + *
> > > > + *   Redistribution and use in source and binary forms, with or without
> > > > + *   modification, are permitted provided that the following conditions
> > > > + *   are met:
> > > > + *
> > > > + *     * Redistributions of source code must retain the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer.
> > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer in
> > > > + *       the documentation and/or other materials provided with the
> > > > + *       distribution.
> > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > + *       contributors may be used to endorse or promote products derived
> > > > + *       from this software without specific prior written permission.
> > > > + *
> > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > + */
> > > > +
> > > > +#include <rte_mbuf.h>
> > > > +#include <rte_ethdev.h>
> > > > +#include <rte_malloc.h>
> > > > +#include <rte_kvargs.h>
> > > > +#include <rte_dev.h>
> > > > +
> > > > +#include <linux/if_ether.h>
> > > > +#include <linux/if_packet.h>
> > > > +#include <arpa/inet.h>
> > > > +#include <net/if.h>
> > > > +#include <sys/types.h>
> > > > +#include <sys/socket.h>
> > > > +#include <sys/ioctl.h>
> > > > +#include <sys/mman.h>
> > > > +#include <unistd.h>
> > > > +#include <poll.h>
> > > > +
> > > > +#include "rte_eth_packet.h"
> > > > +
> > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > +
> > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > +
> > > > +struct pkt_rx_queue {
> > > > +	int sockfd;
> > > > +
> > > > +	struct iovec *rd;
> > > > +	uint8_t *map;
> > > > +	unsigned int framecount;
> > > > +	unsigned int framenum;
> > > > +
> > > > +	struct rte_mempool *mb_pool;
> > > > +
> > > > +	volatile unsigned long rx_pkts;
> > > > +	volatile unsigned long err_pkts;
> > > > +};
> > > > +
> > > > +struct pkt_tx_queue {
> > > > +	int sockfd;
> > > > +
> > > > +	struct iovec *rd;
> > > > +	uint8_t *map;
> > > > +	unsigned int framecount;
> > > > +	unsigned int framenum;
> > > > +
> > > > +	volatile unsigned long tx_pkts;
> > > > +	volatile unsigned long err_pkts;
> > > > +};
> > > > +
> > > > +struct pmd_internals {
> > > > +	unsigned nb_queues;
> > > > +
> > > > +	int if_index;
> > > > +	struct ether_addr eth_addr;
> > > > +
> > > > +	struct tpacket_req req;
> > > > +
> > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > +};
> > > > +
> > > > +static const char *valid_arguments[] = {
> > > > +	ETH_PACKET_IFACE_ARG,
> > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > +	NULL
> > > > +};
> > > > +
> > > > +static const char *drivername = "AF_PACKET PMD";
> > > > +
> > > > +static struct rte_eth_link pmd_link = {
> > > > +	.link_speed = 10000,
> > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > +	.link_status = 0
> > > > +};
> > > > +
> > > > +static uint16_t
> > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > +{
> > > > +	unsigned i;
> > > > +	struct tpacket2_hdr *ppd;
> > > > +	struct rte_mbuf *mbuf;
> > > > +	uint8_t *pbuf;
> > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > +	uint16_t num_rx = 0;
> > > > +	unsigned int framecount, framenum;
> > > > +
> > > > +	if (unlikely(nb_pkts == 0))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > +	 */
> > > > +	framecount = pkt_q->framecount;
> > > > +	framenum = pkt_q->framenum;
> > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > +		/* point at the next incoming frame */
> > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > +			break;
> > > > +
> > > > +		/* allocate the next mbuf */
> > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > +		if (unlikely(mbuf == NULL))
> > > > +			break;
> > > > +
> > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > +
> > > > +		/* release incoming frame and advance ring buffer */
> > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > +		if (++framenum >= framecount)
> > > > +			framenum = 0;
> > > > +
> > > > +		/* account for the receive frame */
> > > > +		bufs[i] = mbuf;
> > > > +		num_rx++;
> > > > +	}
> > > > +	pkt_q->framenum = framenum;
> > > > +	pkt_q->rx_pkts += num_rx;
> > > > +	return num_rx;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Callback to handle sending packets through a real NIC.
> > > > + */
> > > > +static uint16_t
> > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > +{
> > > > +	struct tpacket2_hdr *ppd;
> > > > +	struct rte_mbuf *mbuf;
> > > > +	uint8_t *pbuf;
> > > > +	unsigned int framecount, framenum;
> > > > +	struct pollfd pfd;
> > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > +	uint16_t num_tx = 0;
> > > > +	int i;
> > > > +
> > > > +	if (unlikely(nb_pkts == 0))
> > > > +		return 0;
> > > > +
> > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > +	pfd.fd = pkt_q->sockfd;
> > > > +	pfd.events = POLLOUT;
> > > > +	pfd.revents = 0;
> > > > +
> > > > +	framecount = pkt_q->framecount;
> > > > +	framenum = pkt_q->framenum;
> > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > +		/* point at the next incoming frame */
> > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > +				continue;
> > > > +
> > > > +		/* copy the tx frame data */
> > > > +		mbuf = bufs[num_tx];
> > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > +			sizeof(struct sockaddr_ll);
> > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > +
> > > > +		/* release incoming frame and advance ring buffer */
> > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > +		if (++framenum >= framecount)
> > > > +			framenum = 0;
> > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > +
> > > > +		num_tx++;
> > > > +		rte_pktmbuf_free(mbuf);
> > > > +	}
> > > > +
> > > > +	/* kick-off transmits */
> > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > +
> > > > +	pkt_q->framenum = framenum;
> > > > +	pkt_q->tx_pkts += num_tx;
> > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > +	return num_tx;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > +{
> > > > +	dev->data->dev_link.link_status = 1;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * This function gets called when the current port gets stopped.
> > > > + */
> > > > +static void
> > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > +{
> > > > +	unsigned i;
> > > > +	int sockfd;
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +
> > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > +		if (sockfd != -1)
> > > > +			close(sockfd);
> > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > +		if (sockfd != -1)
> > > > +			close(sockfd);
> > > > +	}
> > > > +
> > > > +	dev->data->dev_link.link_status = 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > +{
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > +{
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +
> > > > +	dev_info->driver_name = drivername;
> > > > +	dev_info->if_index = internals->if_index;
> > > > +	dev_info->max_mac_addrs = 1;
> > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > +	dev_info->min_rx_bufsize = 0;
> > > > +	dev_info->pci_dev = NULL;
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > +{
> > > > +	unsigned i, imax;
> > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > +
> > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > +
> > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > +	for (i = 0; i < imax; i++) {
> > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > +	}
> > > > +
> > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > +	for (i = 0; i < imax; i++) {
> > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > +		tx_total += igb_stats->q_opackets[i];
> > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > +	}
> > > > +
> > > > +	igb_stats->ipackets = rx_total;
> > > > +	igb_stats->opackets = tx_total;
> > > > +	igb_stats->oerrors = tx_err_total;
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > +{
> > > > +	unsigned i;
> > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > +
> > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > +
> > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > +	}
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > +{
> > > > +}
> > > > +
> > > > +static void
> > > > +eth_queue_release(void *q __rte_unused)
> > > > +{
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > +                int wait_to_complete __rte_unused)
> > > > +{
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > +                   uint16_t rx_queue_id,
> > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > +                   unsigned int socket_id __rte_unused,
> > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > +                   struct rte_mempool *mb_pool)
> > > > +{
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > +	uint16_t buf_size;
> > > > +
> > > > +	pkt_q->mb_pool = mb_pool;
> > > > +
> > > > +	/* Now get the space available for data in the mbuf */
> > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > +
> > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > +		return -ENOMEM;
> > > > +	}
> > > > +
> > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > +                   uint16_t tx_queue_id,
> > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > +                   unsigned int socket_id __rte_unused,
> > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > +{
> > > > +
> > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > +
> > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static struct eth_dev_ops ops = {
> > > > +	.dev_start = eth_dev_start,
> > > > +	.dev_stop = eth_dev_stop,
> > > > +	.dev_close = eth_dev_close,
> > > > +	.dev_configure = eth_dev_configure,
> > > > +	.dev_infos_get = eth_dev_info,
> > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > +	.rx_queue_release = eth_queue_release,
> > > > +	.tx_queue_release = eth_queue_release,
> > > > +	.link_update = eth_link_update,
> > > > +	.stats_get = eth_stats_get,
> > > > +	.stats_reset = eth_stats_reset,
> > > > +};
> > > > +
> > > > +/*
> > > > + * Opens an AF_PACKET socket
> > > > + */
> > > > +static int
> > > > +open_packet_iface(const char *key __rte_unused,
> > > > +                  const char *value __rte_unused,
> > > > +                  void *extra_args)
> > > > +{
> > > > +	int *sockfd = extra_args;
> > > > +
> > > > +	/* Open an AF_PACKET socket... */
> > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > +	if (*sockfd == -1) {
> > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +rte_pmd_init_internals(const char *name,
> > > > +                       const int sockfd,
> > > > +                       const unsigned nb_queues,
> > > > +                       unsigned int blocksize,
> > > > +                       unsigned int blockcnt,
> > > > +                       unsigned int framesize,
> > > > +                       unsigned int framecnt,
> > > > +                       const unsigned numa_node,
> > > > +                       struct pmd_internals **internals,
> > > > +                       struct rte_eth_dev **eth_dev,
> > > > +                       struct rte_kvargs *kvlist)
> > > > +{
> > > > +	struct rte_eth_dev_data *data = NULL;
> > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > +	struct ifreq ifr;
> > > > +	size_t ifnamelen;
> > > > +	unsigned k_idx;
> > > > +	struct sockaddr_ll sockaddr;
> > > > +	struct tpacket_req *req;
> > > > +	struct pkt_rx_queue *rx_queue;
> > > > +	struct pkt_tx_queue *tx_queue;
> > > > +	int rc, tpver, discard, bypass;
> > > > +	unsigned int i, q, rdsize;
> > > > +	int qsockfd, fanout_arg;
> > > > +
> > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > +		pair = &kvlist->pairs[k_idx];
> > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > +			break;
> > > > +	}
> > > > +	if (pair == NULL) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > +		        name);
> > > > +		goto error;
> > > > +	}
> > > > +
> > > > +	RTE_LOG(INFO, PMD,
> > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > +		name, numa_node);
> > > > +
> > > > +	/*
> > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > +	 * and internal (private) data
> > > > +	 */
> > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > +	if (data == NULL)
> > > > +		goto error;
> > > > +
> > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > +	if (pci_dev == NULL)
> > > > +		goto error;
> > > > +
> > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > +	                                0, numa_node);
> > > > +	if (*internals == NULL)
> > > > +		goto error;
> > > > +
> > > > +	req = &((*internals)->req);
> > > > +
> > > > +	req->tp_block_size = blocksize;
> > > > +	req->tp_block_nr = blockcnt;
> > > > +	req->tp_frame_size = framesize;
> > > > +	req->tp_frame_nr = framecnt;
> > > > +
> > > > +	ifnamelen = strlen(pair->value);
> > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > +	} else {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: I/F name too long (%s)\n",
> > > > +			name, pair->value);
> > > > +		goto error;
> > > > +	}
> > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > +		        name);
> > > > +		goto error;
> > > > +	}
> > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > +
> > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > +		        name);
> > > > +		goto error;
> > > > +	}
> > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > +
> > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > +	sockaddr.sll_family = AF_PACKET;
> > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > +
> > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > +
> > > > +	for (q = 0; q < nb_queues; q++) {
> > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > +		if (qsockfd == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > +			        name);
> > > > +			return -1;
> > > > +		}
> > > > +
> > > > +		tpver = TPACKET_V2;
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > +				&tpver, sizeof(tpver));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > +				"socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		discard = 1;
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > +				&discard, sizeof(discard));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_LOSS on "
> > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		bypass = 1;
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > +				&bypass, sizeof(bypass));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > +			        pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > +				"socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > +				"socket for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > +
> > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > +				    qsockfd, 0);
> > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > +				name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		/* rdsize is same for both Tx and Rx */
> > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > +
> > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > +		}
> > > > +		rx_queue->sockfd = qsockfd;
> > > > +
> > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > +
> > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > +
> > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > +		}
> > > > +		tx_queue->sockfd = qsockfd;
> > > > +
> > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > +			        name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +
> > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > +		if (rc == -1) {
> > > > +			RTE_LOG(ERR, PMD,
> > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > +				"for %s\n", name, pair->value);
> > > > +			goto error;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	/* reserve an ethdev entry */
> > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > +	if (*eth_dev == NULL)
> > > > +		goto error;
> > > > +
> > > > +	/*
> > > > +	 * now put it all together
> > > > +	 * - store queue data in internals,
> > > > +	 * - store numa_node info in pci_driver
> > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > +	 */
> > > > +
> > > > +	(*internals)->nb_queues = nb_queues;
> > > > +
> > > > +	data->dev_private = *internals;
> > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > +	data->dev_link = pmd_link;
> > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > +
> > > > +	pci_dev->numa_node = numa_node;
> > > > +
> > > > +	(*eth_dev)->data = data;
> > > > +	(*eth_dev)->dev_ops = &ops;
> > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +error:
> > > > +	if (data)
> > > > +		rte_free(data);
> > > > +	if (pci_dev)
> > > > +		rte_free(pci_dev);
> > > > +	for (q = 0; q < nb_queues; q++) {
> > > > +		if ((*internals)->rx_queue[q].rd)
> > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > +		if ((*internals)->tx_queue[q].rd)
> > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > +	}
> > > > +	if (*internals)
> > > > +		rte_free(*internals);
> > > > +	return -1;
> > > > +}
> > > > +
> > > > +static int
> > > > +rte_eth_from_packet(const char *name,
> > > > +                    int const *sockfd,
> > > > +                    const unsigned numa_node,
> > > > +                    struct rte_kvargs *kvlist)
> > > > +{
> > > > +	struct pmd_internals *internals = NULL;
> > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > +	unsigned k_idx;
> > > > +	unsigned int blockcount;
> > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > +	unsigned int qpairs = 1;
> > > > +
> > > > +	/* do some parameter checking */
> > > > +	if (*sockfd < 0)
> > > > +		return -1;
> > > > +
> > > > +	/*
> > > > +	 * Walk arguments for configurable settings
> > > > +	 */
> > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > +		pair = &kvlist->pairs[k_idx];
> > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > +			qpairs = atoi(pair->value);
> > > > +			if (qpairs < 1 ||
> > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid qpairs value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > +			blocksize = atoi(pair->value);
> > > > +			if (!blocksize) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid blocksize value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > +			framesize = atoi(pair->value);
> > > > +			if (!framesize) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid framesize value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > +			framecount = atoi(pair->value);
> > > > +			if (!framecount) {
> > > > +				RTE_LOG(ERR, PMD,
> > > > +					"%s: invalid framecount value\n",
> > > > +				        name);
> > > > +				return -1;
> > > > +			}
> > > > +			continue;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	if (framesize > blocksize) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > +		        name);
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	blockcount = framecount / (blocksize / framesize);
> > > > +	if (!blockcount) {
> > > > +		RTE_LOG(ERR, PMD,
> > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > +
> > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > +	                           blocksize, blockcount,
> > > > +	                           framesize, framecount,
> > > > +	                           numa_node, &internals, &eth_dev,
> > > > +	                           kvlist) < 0)
> > > > +		return -1;
> > > > +
> > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +int
> > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > +{
> > > > +	unsigned numa_node;
> > > > +	int ret;
> > > > +	struct rte_kvargs *kvlist;
> > > > +	int sockfd = -1;
> > > > +
> > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > +
> > > > +	numa_node = rte_socket_id();
> > > > +
> > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > +	if (kvlist == NULL)
> > > > +		return -1;
> > > > +
> > > > +	/*
> > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > +	 * reading / writing
> > > > +	 */
> > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > +
> > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > +		                         &open_packet_iface, &sockfd);
> > > > +		if (ret < 0)
> > > > +			return -1;
> > > > +	}
> > > > +
> > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > +	close(sockfd); /* no longer needed */
> > > > +
> > > > +	if (ret < 0)
> > > > +		return -1;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static struct rte_driver pmd_packet_drv = {
> > > > +	.name = "eth_packet",
> > > > +	.type = PMD_VDEV,
> > > > +	.init = rte_pmd_packet_devinit,
> > > > +};
> > > > +
> > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > new file mode 100644
> > > > index 000000000000..f685611da3e9
> > > > --- /dev/null
> > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > @@ -0,0 +1,55 @@
> > > > +/*-
> > > > + *   BSD LICENSE
> > > > + *
> > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > + *   All rights reserved.
> > > > + *
> > > > + *   Redistribution and use in source and binary forms, with or without
> > > > + *   modification, are permitted provided that the following conditions
> > > > + *   are met:
> > > > + *
> > > > + *     * Redistributions of source code must retain the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer.
> > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer in
> > > > + *       the documentation and/or other materials provided with the
> > > > + *       distribution.
> > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > + *       contributors may be used to endorse or promote products derived
> > > > + *       from this software without specific prior written permission.
> > > > + *
> > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > + */
> > > > +
> > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > +#define _RTE_ETH_PACKET_H_
> > > > +
> > > > +#ifdef __cplusplus
> > > > +extern "C" {
> > > > +#endif
> > > > +
> > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > +
> > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > +
> > > > +/**
> > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > + * configured on command line.
> > > > + */
> > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > +
> > > > +#ifdef __cplusplus
> > > > +}
> > > > +#endif
> > > > +
> > > > +#endif
> > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > --- a/mk/rte.app.mk
> > > > +++ b/mk/rte.app.mk
> > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > >  endif
> > > >
> > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > +LDLIBS += -lrte_pmd_packet
> > > > +endif
> > > > +
> > > >  endif # plugins
> > > >
> > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > --
> > > > 1.9.3
> > > >
> > > >
> > >
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> >
> 
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.
  
Neil Horman Sept. 15, 2014, 3:09 p.m. UTC | #13
On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > -----Original Message-----
> > From: John W. Linville [mailto:linville@tuxdriver.com]
> > Sent: Saturday, September 13, 2014 2:54 AM
> > To: Zhou, Danny
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > 
> > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > I am concerned about its performance caused by too many
> > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > which are mapped to user space, and then those packets to be copied
> > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > copies which brings significant negative performance impact. We
> > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > code changes in kernel, John R will be presenting it in coming Linux
> > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > submitted to dpdk.org.
> > 
> > Admittedly, this is not as good a performer as most of the existing
> > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > previously indicate that it performed better than the pcap-based PMD.
> 
> Yes, slightly higher but makes no big difference.
> 
Do you have numbers for this?  It seems to me faster is faster as long as its
statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
to scale to multple cpus more easily than the pcap pmd, as it can make use of
the AF_PACKET fanout feature.

> > I look forward to seeing the changes you mention -- they sound very
> > exciting.  But, they will still require both networking core and
> > driver changes in the kernel.  And as I understand things today,
> > the userland code will still need at least some knowledge of specific
> > devices and how they layout their packet descriptors, etc.  So while
> > those changes sound very promising, they will still have certain
> > drawbacks in common with the current situation.
> 
> Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate device-specific 
> packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will be much easier
> to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> 

Not sure how this relates, what you're describing is the feature intel has been
working on to augment kernel drivers to provide better throughput via direct
hardware access to user space.  Johns PMD provides ubiquitous function on all
hardware. I'm not sure how the desire for one implies the other isn't valuable?

> > It seems like the changes you mention will still need some sort of
> > AF_PACKET-based PMD driver.  Have you implemented that completely
> > separate from the code I already posted?  Or did you add that work
> > on top of mine?
> > 
> 
> For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev library to do device
> probe and support new socket options.
> 

Ok, but again, PMD's are independent, and serve different needs.  If they're use
is at all overlapping from a functional standpoint, take this one now, and
deprecate it when a better one comes along.  Though from your description it
seems like both have a valid place in the ecosystem.

Neil

> > John
> > 
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > > Sent: Saturday, September 13, 2014 2:05 AM
> > > > To: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > >
> > > > Ping?  Are there objections to this patch from mid-July?
> > > >
> > > > John
> > > >
> > > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > >
> > > > > Interfaces of this type are created with a command line option like
> > > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > > as arguments:
> > > > >
> > > > >  - Interface is chosen by "iface" (required)
> > > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > > >
> > > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > > ---
> > > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > > with better performance than what PCAP offers in Linux.  This might
> > > > > be useful as a development platform for DPDK applications when
> > > > > DPDK-supported hardware is expensive or unavailable.
> > > > >
> > > > > New in v2:
> > > > >
> > > > > -- fixup some style issues found by check patch
> > > > > -- use if_index as part of fanout group ID
> > > > > -- set default number of queue pairs to 1
> > > > >
> > > > >  config/common_bsdapp                   |   5 +
> > > > >  config/common_linuxapp                 |   5 +
> > > > >  lib/Makefile                           |   1 +
> > > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > > >  mk/rte.app.mk                          |   4 +
> > > > >  8 files changed, 957 insertions(+)
> > > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > > >
> > > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > > index 943dce8f1ede..c317f031278e 100644
> > > > > --- a/config/common_bsdapp
> > > > > +++ b/config/common_bsdapp
> > > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > >
> > > > >  #
> > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > +#
> > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > > +
> > > > > +#
> > > > >  # Do prefetch of packet data within PMD driver receive function
> > > > >  #
> > > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > > --- a/config/common_linuxapp
> > > > > +++ b/config/common_linuxapp
> > > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > >
> > > > >  #
> > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > +#
> > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > > +
> > > > > +#
> > > > >  # Compile Xen PMD
> > > > >  #
> > > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > index 10c5bb3045bc..930fadf29898 100644
> > > > > --- a/lib/Makefile
> > > > > +++ b/lib/Makefile
> > > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > index 756d6b0c9301..feed24a63272 100644
> > > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > > >
> > > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > > new file mode 100644
> > > > > index 000000000000..e1266fb992cd
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > > @@ -0,0 +1,60 @@
> > > > > +#   BSD LICENSE
> > > > > +#
> > > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > > +#   All rights reserved.
> > > > > +#
> > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > +#   modification, are permitted provided that the following conditions
> > > > > +#   are met:
> > > > > +#
> > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > +#       the documentation and/or other materials provided with the
> > > > > +#       distribution.
> > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > +#       contributors may be used to endorse or promote products derived
> > > > > +#       from this software without specific prior written permission.
> > > > > +#
> > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > +
> > > > > +#
> > > > > +# library name
> > > > > +#
> > > > > +LIB = librte_pmd_packet.a
> > > > > +
> > > > > +CFLAGS += -O3
> > > > > +CFLAGS += $(WERROR_FLAGS)
> > > > > +
> > > > > +#
> > > > > +# all source are stored in SRCS-y
> > > > > +#
> > > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > > +
> > > > > +#
> > > > > +# Export include files
> > > > > +#
> > > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > > +
> > > > > +# this lib depends upon:
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > new file mode 100644
> > > > > index 000000000000..9c82d16e730f
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > @@ -0,0 +1,826 @@
> > > > > +/*-
> > > > > + *   BSD LICENSE
> > > > > + *
> > > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > > + *
> > > > > + *   Originally based upon librte_pmd_pcap code:
> > > > > + *
> > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > > + *   All rights reserved.
> > > > > + *
> > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > + *   modification, are permitted provided that the following conditions
> > > > > + *   are met:
> > > > > + *
> > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > + *       the documentation and/or other materials provided with the
> > > > > + *       distribution.
> > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > + *       contributors may be used to endorse or promote products derived
> > > > > + *       from this software without specific prior written permission.
> > > > > + *
> > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > + */
> > > > > +
> > > > > +#include <rte_mbuf.h>
> > > > > +#include <rte_ethdev.h>
> > > > > +#include <rte_malloc.h>
> > > > > +#include <rte_kvargs.h>
> > > > > +#include <rte_dev.h>
> > > > > +
> > > > > +#include <linux/if_ether.h>
> > > > > +#include <linux/if_packet.h>
> > > > > +#include <arpa/inet.h>
> > > > > +#include <net/if.h>
> > > > > +#include <sys/types.h>
> > > > > +#include <sys/socket.h>
> > > > > +#include <sys/ioctl.h>
> > > > > +#include <sys/mman.h>
> > > > > +#include <unistd.h>
> > > > > +#include <poll.h>
> > > > > +
> > > > > +#include "rte_eth_packet.h"
> > > > > +
> > > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > > +
> > > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > > +
> > > > > +struct pkt_rx_queue {
> > > > > +	int sockfd;
> > > > > +
> > > > > +	struct iovec *rd;
> > > > > +	uint8_t *map;
> > > > > +	unsigned int framecount;
> > > > > +	unsigned int framenum;
> > > > > +
> > > > > +	struct rte_mempool *mb_pool;
> > > > > +
> > > > > +	volatile unsigned long rx_pkts;
> > > > > +	volatile unsigned long err_pkts;
> > > > > +};
> > > > > +
> > > > > +struct pkt_tx_queue {
> > > > > +	int sockfd;
> > > > > +
> > > > > +	struct iovec *rd;
> > > > > +	uint8_t *map;
> > > > > +	unsigned int framecount;
> > > > > +	unsigned int framenum;
> > > > > +
> > > > > +	volatile unsigned long tx_pkts;
> > > > > +	volatile unsigned long err_pkts;
> > > > > +};
> > > > > +
> > > > > +struct pmd_internals {
> > > > > +	unsigned nb_queues;
> > > > > +
> > > > > +	int if_index;
> > > > > +	struct ether_addr eth_addr;
> > > > > +
> > > > > +	struct tpacket_req req;
> > > > > +
> > > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > +};
> > > > > +
> > > > > +static const char *valid_arguments[] = {
> > > > > +	ETH_PACKET_IFACE_ARG,
> > > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > > +	NULL
> > > > > +};
> > > > > +
> > > > > +static const char *drivername = "AF_PACKET PMD";
> > > > > +
> > > > > +static struct rte_eth_link pmd_link = {
> > > > > +	.link_speed = 10000,
> > > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > > +	.link_status = 0
> > > > > +};
> > > > > +
> > > > > +static uint16_t
> > > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > +{
> > > > > +	unsigned i;
> > > > > +	struct tpacket2_hdr *ppd;
> > > > > +	struct rte_mbuf *mbuf;
> > > > > +	uint8_t *pbuf;
> > > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > > +	uint16_t num_rx = 0;
> > > > > +	unsigned int framecount, framenum;
> > > > > +
> > > > > +	if (unlikely(nb_pkts == 0))
> > > > > +		return 0;
> > > > > +
> > > > > +	/*
> > > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > > +	 */
> > > > > +	framecount = pkt_q->framecount;
> > > > > +	framenum = pkt_q->framenum;
> > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > +		/* point at the next incoming frame */
> > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > > +			break;
> > > > > +
> > > > > +		/* allocate the next mbuf */
> > > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > > +		if (unlikely(mbuf == NULL))
> > > > > +			break;
> > > > > +
> > > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > > +
> > > > > +		/* release incoming frame and advance ring buffer */
> > > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > > +		if (++framenum >= framecount)
> > > > > +			framenum = 0;
> > > > > +
> > > > > +		/* account for the receive frame */
> > > > > +		bufs[i] = mbuf;
> > > > > +		num_rx++;
> > > > > +	}
> > > > > +	pkt_q->framenum = framenum;
> > > > > +	pkt_q->rx_pkts += num_rx;
> > > > > +	return num_rx;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Callback to handle sending packets through a real NIC.
> > > > > + */
> > > > > +static uint16_t
> > > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > +{
> > > > > +	struct tpacket2_hdr *ppd;
> > > > > +	struct rte_mbuf *mbuf;
> > > > > +	uint8_t *pbuf;
> > > > > +	unsigned int framecount, framenum;
> > > > > +	struct pollfd pfd;
> > > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > > +	uint16_t num_tx = 0;
> > > > > +	int i;
> > > > > +
> > > > > +	if (unlikely(nb_pkts == 0))
> > > > > +		return 0;
> > > > > +
> > > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > > +	pfd.fd = pkt_q->sockfd;
> > > > > +	pfd.events = POLLOUT;
> > > > > +	pfd.revents = 0;
> > > > > +
> > > > > +	framecount = pkt_q->framecount;
> > > > > +	framenum = pkt_q->framenum;
> > > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > +		/* point at the next incoming frame */
> > > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > > +				continue;
> > > > > +
> > > > > +		/* copy the tx frame data */
> > > > > +		mbuf = bufs[num_tx];
> > > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > > +			sizeof(struct sockaddr_ll);
> > > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > > +
> > > > > +		/* release incoming frame and advance ring buffer */
> > > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > > +		if (++framenum >= framecount)
> > > > > +			framenum = 0;
> > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > +
> > > > > +		num_tx++;
> > > > > +		rte_pktmbuf_free(mbuf);
> > > > > +	}
> > > > > +
> > > > > +	/* kick-off transmits */
> > > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > > +
> > > > > +	pkt_q->framenum = framenum;
> > > > > +	pkt_q->tx_pkts += num_tx;
> > > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > > +	return num_tx;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > > +{
> > > > > +	dev->data->dev_link.link_status = 1;
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * This function gets called when the current port gets stopped.
> > > > > + */
> > > > > +static void
> > > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > > +{
> > > > > +	unsigned i;
> > > > > +	int sockfd;
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +
> > > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > > +		if (sockfd != -1)
> > > > > +			close(sockfd);
> > > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > > +		if (sockfd != -1)
> > > > > +			close(sockfd);
> > > > > +	}
> > > > > +
> > > > > +	dev->data->dev_link.link_status = 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > > +{
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +
> > > > > +	dev_info->driver_name = drivername;
> > > > > +	dev_info->if_index = internals->if_index;
> > > > > +	dev_info->max_mac_addrs = 1;
> > > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > > +	dev_info->min_rx_bufsize = 0;
> > > > > +	dev_info->pci_dev = NULL;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > > +{
> > > > > +	unsigned i, imax;
> > > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > > +
> > > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > > +
> > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > +	for (i = 0; i < imax; i++) {
> > > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > > +	}
> > > > > +
> > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > +	for (i = 0; i < imax; i++) {
> > > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > > +		tx_total += igb_stats->q_opackets[i];
> > > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > > +	}
> > > > > +
> > > > > +	igb_stats->ipackets = rx_total;
> > > > > +	igb_stats->opackets = tx_total;
> > > > > +	igb_stats->oerrors = tx_err_total;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > > +{
> > > > > +	unsigned i;
> > > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > > +
> > > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > > +
> > > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > > +{
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +eth_queue_release(void *q __rte_unused)
> > > > > +{
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > > +                int wait_to_complete __rte_unused)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > > +                   uint16_t rx_queue_id,
> > > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > > +                   unsigned int socket_id __rte_unused,
> > > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > > +                   struct rte_mempool *mb_pool)
> > > > > +{
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > > +	uint16_t buf_size;
> > > > > +
> > > > > +	pkt_q->mb_pool = mb_pool;
> > > > > +
> > > > > +	/* Now get the space available for data in the mbuf */
> > > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > > +
> > > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > > +		return -ENOMEM;
> > > > > +	}
> > > > > +
> > > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > > +                   uint16_t tx_queue_id,
> > > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > > +                   unsigned int socket_id __rte_unused,
> > > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > > +{
> > > > > +
> > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > +
> > > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static struct eth_dev_ops ops = {
> > > > > +	.dev_start = eth_dev_start,
> > > > > +	.dev_stop = eth_dev_stop,
> > > > > +	.dev_close = eth_dev_close,
> > > > > +	.dev_configure = eth_dev_configure,
> > > > > +	.dev_infos_get = eth_dev_info,
> > > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > > +	.rx_queue_release = eth_queue_release,
> > > > > +	.tx_queue_release = eth_queue_release,
> > > > > +	.link_update = eth_link_update,
> > > > > +	.stats_get = eth_stats_get,
> > > > > +	.stats_reset = eth_stats_reset,
> > > > > +};
> > > > > +
> > > > > +/*
> > > > > + * Opens an AF_PACKET socket
> > > > > + */
> > > > > +static int
> > > > > +open_packet_iface(const char *key __rte_unused,
> > > > > +                  const char *value __rte_unused,
> > > > > +                  void *extra_args)
> > > > > +{
> > > > > +	int *sockfd = extra_args;
> > > > > +
> > > > > +	/* Open an AF_PACKET socket... */
> > > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > +	if (*sockfd == -1) {
> > > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > > +		return -1;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +rte_pmd_init_internals(const char *name,
> > > > > +                       const int sockfd,
> > > > > +                       const unsigned nb_queues,
> > > > > +                       unsigned int blocksize,
> > > > > +                       unsigned int blockcnt,
> > > > > +                       unsigned int framesize,
> > > > > +                       unsigned int framecnt,
> > > > > +                       const unsigned numa_node,
> > > > > +                       struct pmd_internals **internals,
> > > > > +                       struct rte_eth_dev **eth_dev,
> > > > > +                       struct rte_kvargs *kvlist)
> > > > > +{
> > > > > +	struct rte_eth_dev_data *data = NULL;
> > > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > +	struct ifreq ifr;
> > > > > +	size_t ifnamelen;
> > > > > +	unsigned k_idx;
> > > > > +	struct sockaddr_ll sockaddr;
> > > > > +	struct tpacket_req *req;
> > > > > +	struct pkt_rx_queue *rx_queue;
> > > > > +	struct pkt_tx_queue *tx_queue;
> > > > > +	int rc, tpver, discard, bypass;
> > > > > +	unsigned int i, q, rdsize;
> > > > > +	int qsockfd, fanout_arg;
> > > > > +
> > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > > +			break;
> > > > > +	}
> > > > > +	if (pair == NULL) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > > +		        name);
> > > > > +		goto error;
> > > > > +	}
> > > > > +
> > > > > +	RTE_LOG(INFO, PMD,
> > > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > > +		name, numa_node);
> > > > > +
> > > > > +	/*
> > > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > > +	 * and internal (private) data
> > > > > +	 */
> > > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > > +	if (data == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > > +	if (pci_dev == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > > +	                                0, numa_node);
> > > > > +	if (*internals == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	req = &((*internals)->req);
> > > > > +
> > > > > +	req->tp_block_size = blocksize;
> > > > > +	req->tp_block_nr = blockcnt;
> > > > > +	req->tp_frame_size = framesize;
> > > > > +	req->tp_frame_nr = framecnt;
> > > > > +
> > > > > +	ifnamelen = strlen(pair->value);
> > > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > > +	} else {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: I/F name too long (%s)\n",
> > > > > +			name, pair->value);
> > > > > +		goto error;
> > > > > +	}
> > > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > > +		        name);
> > > > > +		goto error;
> > > > > +	}
> > > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > > +
> > > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > > +		        name);
> > > > > +		goto error;
> > > > > +	}
> > > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > > +
> > > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > > +	sockaddr.sll_family = AF_PACKET;
> > > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > > +
> > > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > > +
> > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > +		if (qsockfd == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > > +			        name);
> > > > > +			return -1;
> > > > > +		}
> > > > > +
> > > > > +		tpver = TPACKET_V2;
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > > +				&tpver, sizeof(tpver));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > > +				"socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		discard = 1;
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > > +				&discard, sizeof(discard));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_LOSS on "
> > > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		bypass = 1;
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > > +				&bypass, sizeof(bypass));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > > +			        pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > > +				"socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > > +				"socket for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > > +
> > > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > > +				    qsockfd, 0);
> > > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > > +				name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		/* rdsize is same for both Tx and Rx */
> > > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > > +
> > > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > +		}
> > > > > +		rx_queue->sockfd = qsockfd;
> > > > > +
> > > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > > +
> > > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > > +
> > > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > +		}
> > > > > +		tx_queue->sockfd = qsockfd;
> > > > > +
> > > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > > +			        name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > > +		if (rc == -1) {
> > > > > +			RTE_LOG(ERR, PMD,
> > > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > > +				"for %s\n", name, pair->value);
> > > > > +			goto error;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	/* reserve an ethdev entry */
> > > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > > +	if (*eth_dev == NULL)
> > > > > +		goto error;
> > > > > +
> > > > > +	/*
> > > > > +	 * now put it all together
> > > > > +	 * - store queue data in internals,
> > > > > +	 * - store numa_node info in pci_driver
> > > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > > +	 */
> > > > > +
> > > > > +	(*internals)->nb_queues = nb_queues;
> > > > > +
> > > > > +	data->dev_private = *internals;
> > > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > > +	data->dev_link = pmd_link;
> > > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > > +
> > > > > +	pci_dev->numa_node = numa_node;
> > > > > +
> > > > > +	(*eth_dev)->data = data;
> > > > > +	(*eth_dev)->dev_ops = &ops;
> > > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +error:
> > > > > +	if (data)
> > > > > +		rte_free(data);
> > > > > +	if (pci_dev)
> > > > > +		rte_free(pci_dev);
> > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > +		if ((*internals)->rx_queue[q].rd)
> > > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > > +		if ((*internals)->tx_queue[q].rd)
> > > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > > +	}
> > > > > +	if (*internals)
> > > > > +		rte_free(*internals);
> > > > > +	return -1;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +rte_eth_from_packet(const char *name,
> > > > > +                    int const *sockfd,
> > > > > +                    const unsigned numa_node,
> > > > > +                    struct rte_kvargs *kvlist)
> > > > > +{
> > > > > +	struct pmd_internals *internals = NULL;
> > > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > +	unsigned k_idx;
> > > > > +	unsigned int blockcount;
> > > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > > +	unsigned int qpairs = 1;
> > > > > +
> > > > > +	/* do some parameter checking */
> > > > > +	if (*sockfd < 0)
> > > > > +		return -1;
> > > > > +
> > > > > +	/*
> > > > > +	 * Walk arguments for configurable settings
> > > > > +	 */
> > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > > +			qpairs = atoi(pair->value);
> > > > > +			if (qpairs < 1 ||
> > > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid qpairs value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > > +			blocksize = atoi(pair->value);
> > > > > +			if (!blocksize) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid blocksize value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > > +			framesize = atoi(pair->value);
> > > > > +			if (!framesize) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid framesize value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > > +			framecount = atoi(pair->value);
> > > > > +			if (!framecount) {
> > > > > +				RTE_LOG(ERR, PMD,
> > > > > +					"%s: invalid framecount value\n",
> > > > > +				        name);
> > > > > +				return -1;
> > > > > +			}
> > > > > +			continue;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	if (framesize > blocksize) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > > +		        name);
> > > > > +		return -1;
> > > > > +	}
> > > > > +
> > > > > +	blockcount = framecount / (blocksize / framesize);
> > > > > +	if (!blockcount) {
> > > > > +		RTE_LOG(ERR, PMD,
> > > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > > +		return -1;
> > > > > +	}
> > > > > +
> > > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > > +
> > > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > > +	                           blocksize, blockcount,
> > > > > +	                           framesize, framecount,
> > > > > +	                           numa_node, &internals, &eth_dev,
> > > > > +	                           kvlist) < 0)
> > > > > +		return -1;
> > > > > +
> > > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +int
> > > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > > +{
> > > > > +	unsigned numa_node;
> > > > > +	int ret;
> > > > > +	struct rte_kvargs *kvlist;
> > > > > +	int sockfd = -1;
> > > > > +
> > > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > > +
> > > > > +	numa_node = rte_socket_id();
> > > > > +
> > > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > > +	if (kvlist == NULL)
> > > > > +		return -1;
> > > > > +
> > > > > +	/*
> > > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > > +	 * reading / writing
> > > > > +	 */
> > > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > > +
> > > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > > +		                         &open_packet_iface, &sockfd);
> > > > > +		if (ret < 0)
> > > > > +			return -1;
> > > > > +	}
> > > > > +
> > > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > > +	close(sockfd); /* no longer needed */
> > > > > +
> > > > > +	if (ret < 0)
> > > > > +		return -1;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static struct rte_driver pmd_packet_drv = {
> > > > > +	.name = "eth_packet",
> > > > > +	.type = PMD_VDEV,
> > > > > +	.init = rte_pmd_packet_devinit,
> > > > > +};
> > > > > +
> > > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > new file mode 100644
> > > > > index 000000000000..f685611da3e9
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > @@ -0,0 +1,55 @@
> > > > > +/*-
> > > > > + *   BSD LICENSE
> > > > > + *
> > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > + *   All rights reserved.
> > > > > + *
> > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > + *   modification, are permitted provided that the following conditions
> > > > > + *   are met:
> > > > > + *
> > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > + *       the documentation and/or other materials provided with the
> > > > > + *       distribution.
> > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > + *       contributors may be used to endorse or promote products derived
> > > > > + *       from this software without specific prior written permission.
> > > > > + *
> > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > + */
> > > > > +
> > > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > > +#define _RTE_ETH_PACKET_H_
> > > > > +
> > > > > +#ifdef __cplusplus
> > > > > +extern "C" {
> > > > > +#endif
> > > > > +
> > > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > > +
> > > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > > +
> > > > > +/**
> > > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > > + * configured on command line.
> > > > > + */
> > > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > > +
> > > > > +#ifdef __cplusplus
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > > +#endif
> > > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > > --- a/mk/rte.app.mk
> > > > > +++ b/mk/rte.app.mk
> > > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > > >  endif
> > > > >
> > > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > > +LDLIBS += -lrte_pmd_packet
> > > > > +endif
> > > > > +
> > > > >  endif # plugins
> > > > >
> > > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > > --
> > > > > 1.9.3
> > > > >
> > > > >
> > > >
> > > > --
> > > > John W. Linville		Someday the world will need a hero, and you
> > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > >
> > 
> > --
> > John W. Linville		Someday the world will need a hero, and you
> > linville@tuxdriver.com			might be all we have.  Be ready.
>
  
John W. Linville Sept. 15, 2014, 3:15 p.m. UTC | #14
On Mon, Sep 15, 2014 at 11:09:46AM -0400, Neil Horman wrote:
> On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Saturday, September 13, 2014 2:54 AM
> > > To: Zhou, Danny
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > 
> > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > I am concerned about its performance caused by too many
> > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > which are mapped to user space, and then those packets to be copied
> > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > copies which brings significant negative performance impact. We
> > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > submitted to dpdk.org.
> > > 
> > > Admittedly, this is not as good a performer as most of the existing
> > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > previously indicate that it performed better than the pcap-based PMD.
> > 
> > Yes, slightly higher but makes no big difference.
> > 
> Do you have numbers for this?  It seems to me faster is faster as long as its
> statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> to scale to multple cpus more easily than the pcap pmd, as it can make use of
> the AF_PACKET fanout feature.
> 
> > > I look forward to seeing the changes you mention -- they sound very
> > > exciting.  But, they will still require both networking core and
> > > driver changes in the kernel.  And as I understand things today,
> > > the userland code will still need at least some knowledge of specific
> > > devices and how they layout their packet descriptors, etc.  So while
> > > those changes sound very promising, they will still have certain
> > > drawbacks in common with the current situation.
> > 
> > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate device-specific 
> > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will be much easier
> > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > 
> 
> Not sure how this relates, what you're describing is the feature intel has been
> working on to augment kernel drivers to provide better throughput via direct
> hardware access to user space.  Johns PMD provides ubiquitous function on all
> hardware. I'm not sure how the desire for one implies the other isn't valuable?
> 
> > > It seems like the changes you mention will still need some sort of
> > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > separate from the code I already posted?  Or did you add that work
> > > on top of mine?
> > > 
> > 
> > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev library to do device
> > probe and support new socket options.
> > 
> 
> Ok, but again, PMD's are independent, and serve different needs.  If they're use
> is at all overlapping from a functional standpoint, take this one now, and
> deprecate it when a better one comes along.  Though from your description it
> seems like both have a valid place in the ecosystem.

That's where I'm at as well -- I don't see anything in the above that
amounts to an argument against the AF_PACKET-based PMD I have posted.
"Wait for ours" doesn't hold much water, especially when we are trying
to address different problems.

John

> 
> Neil
> 
> > > John
> > > 
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > > > Sent: Saturday, September 13, 2014 2:05 AM
> > > > > To: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > >
> > > > > Ping?  Are there objections to this patch from mid-July?
> > > > >
> > > > > John
> > > > >
> > > > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > > >
> > > > > > Interfaces of this type are created with a command line option like
> > > > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > > > as arguments:
> > > > > >
> > > > > >  - Interface is chosen by "iface" (required)
> > > > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > > > >
> > > > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > > > ---
> > > > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > > > with better performance than what PCAP offers in Linux.  This might
> > > > > > be useful as a development platform for DPDK applications when
> > > > > > DPDK-supported hardware is expensive or unavailable.
> > > > > >
> > > > > > New in v2:
> > > > > >
> > > > > > -- fixup some style issues found by check patch
> > > > > > -- use if_index as part of fanout group ID
> > > > > > -- set default number of queue pairs to 1
> > > > > >
> > > > > >  config/common_bsdapp                   |   5 +
> > > > > >  config/common_linuxapp                 |   5 +
> > > > > >  lib/Makefile                           |   1 +
> > > > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > > > >  mk/rte.app.mk                          |   4 +
> > > > > >  8 files changed, 957 insertions(+)
> > > > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > > > >
> > > > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > > > index 943dce8f1ede..c317f031278e 100644
> > > > > > --- a/config/common_bsdapp
> > > > > > +++ b/config/common_bsdapp
> > > > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > > > +
> > > > > > +#
> > > > > >  # Do prefetch of packet data within PMD driver receive function
> > > > > >  #
> > > > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > > > --- a/config/common_linuxapp
> > > > > > +++ b/config/common_linuxapp
> > > > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > > > +
> > > > > > +#
> > > > > >  # Compile Xen PMD
> > > > > >  #
> > > > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > > index 10c5bb3045bc..930fadf29898 100644
> > > > > > --- a/lib/Makefile
> > > > > > +++ b/lib/Makefile
> > > > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > index 756d6b0c9301..feed24a63272 100644
> > > > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > > > >
> > > > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > > > new file mode 100644
> > > > > > index 000000000000..e1266fb992cd
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > > > @@ -0,0 +1,60 @@
> > > > > > +#   BSD LICENSE
> > > > > > +#
> > > > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > > > +#   All rights reserved.
> > > > > > +#
> > > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > > +#   modification, are permitted provided that the following conditions
> > > > > > +#   are met:
> > > > > > +#
> > > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > > +#       the documentation and/or other materials provided with the
> > > > > > +#       distribution.
> > > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > > +#       contributors may be used to endorse or promote products derived
> > > > > > +#       from this software without specific prior written permission.
> > > > > > +#
> > > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > > +
> > > > > > +#
> > > > > > +# library name
> > > > > > +#
> > > > > > +LIB = librte_pmd_packet.a
> > > > > > +
> > > > > > +CFLAGS += -O3
> > > > > > +CFLAGS += $(WERROR_FLAGS)
> > > > > > +
> > > > > > +#
> > > > > > +# all source are stored in SRCS-y
> > > > > > +#
> > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > > > +
> > > > > > +#
> > > > > > +# Export include files
> > > > > > +#
> > > > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > > > +
> > > > > > +# this lib depends upon:
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..9c82d16e730f
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > @@ -0,0 +1,826 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > > > + *
> > > > > > + *   Originally based upon librte_pmd_pcap code:
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#include <rte_mbuf.h>
> > > > > > +#include <rte_ethdev.h>
> > > > > > +#include <rte_malloc.h>
> > > > > > +#include <rte_kvargs.h>
> > > > > > +#include <rte_dev.h>
> > > > > > +
> > > > > > +#include <linux/if_ether.h>
> > > > > > +#include <linux/if_packet.h>
> > > > > > +#include <arpa/inet.h>
> > > > > > +#include <net/if.h>
> > > > > > +#include <sys/types.h>
> > > > > > +#include <sys/socket.h>
> > > > > > +#include <sys/ioctl.h>
> > > > > > +#include <sys/mman.h>
> > > > > > +#include <unistd.h>
> > > > > > +#include <poll.h>
> > > > > > +
> > > > > > +#include "rte_eth_packet.h"
> > > > > > +
> > > > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > > > +
> > > > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > > > +
> > > > > > +struct pkt_rx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	struct rte_mempool *mb_pool;
> > > > > > +
> > > > > > +	volatile unsigned long rx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pkt_tx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	volatile unsigned long tx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pmd_internals {
> > > > > > +	unsigned nb_queues;
> > > > > > +
> > > > > > +	int if_index;
> > > > > > +	struct ether_addr eth_addr;
> > > > > > +
> > > > > > +	struct tpacket_req req;
> > > > > > +
> > > > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +};
> > > > > > +
> > > > > > +static const char *valid_arguments[] = {
> > > > > > +	ETH_PACKET_IFACE_ARG,
> > > > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > > > +	NULL
> > > > > > +};
> > > > > > +
> > > > > > +static const char *drivername = "AF_PACKET PMD";
> > > > > > +
> > > > > > +static struct rte_eth_link pmd_link = {
> > > > > > +	.link_speed = 10000,
> > > > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > > > +	.link_status = 0
> > > > > > +};
> > > > > > +
> > > > > > +static uint16_t
> > > > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_rx = 0;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > > > +	 */
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* allocate the next mbuf */
> > > > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > > > +		if (unlikely(mbuf == NULL))
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +
> > > > > > +		/* account for the receive frame */
> > > > > > +		bufs[i] = mbuf;
> > > > > > +		num_rx++;
> > > > > > +	}
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->rx_pkts += num_rx;
> > > > > > +	return num_rx;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Callback to handle sending packets through a real NIC.
> > > > > > + */
> > > > > > +static uint16_t
> > > > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +	struct pollfd pfd;
> > > > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_tx = 0;
> > > > > > +	int i;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > > > +	pfd.fd = pkt_q->sockfd;
> > > > > > +	pfd.events = POLLOUT;
> > > > > > +	pfd.revents = 0;
> > > > > > +
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > > > +				continue;
> > > > > > +
> > > > > > +		/* copy the tx frame data */
> > > > > > +		mbuf = bufs[num_tx];
> > > > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > > > +			sizeof(struct sockaddr_ll);
> > > > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +
> > > > > > +		num_tx++;
> > > > > > +		rte_pktmbuf_free(mbuf);
> > > > > > +	}
> > > > > > +
> > > > > > +	/* kick-off transmits */
> > > > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > > > +
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->tx_pkts += num_tx;
> > > > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > > > +	return num_tx;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	dev->data->dev_link.link_status = 1;
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * This function gets called when the current port gets stopped.
> > > > > > + */
> > > > > > +static void
> > > > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	int sockfd;
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->dev_link.link_status = 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev_info->driver_name = drivername;
> > > > > > +	dev_info->if_index = internals->if_index;
> > > > > > +	dev_info->max_mac_addrs = 1;
> > > > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->min_rx_bufsize = 0;
> > > > > > +	dev_info->pci_dev = NULL;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > > > +{
> > > > > > +	unsigned i, imax;
> > > > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > > > +		tx_total += igb_stats->q_opackets[i];
> > > > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	igb_stats->ipackets = rx_total;
> > > > > > +	igb_stats->opackets = tx_total;
> > > > > > +	igb_stats->oerrors = tx_err_total;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_queue_release(void *q __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > > > +                int wait_to_complete __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t rx_queue_id,
> > > > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > > > +                   struct rte_mempool *mb_pool)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > > > +	uint16_t buf_size;
> > > > > > +
> > > > > > +	pkt_q->mb_pool = mb_pool;
> > > > > > +
> > > > > > +	/* Now get the space available for data in the mbuf */
> > > > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > > > +
> > > > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > > > +		return -ENOMEM;
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t tx_queue_id,
> > > > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > > > +{
> > > > > > +
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct eth_dev_ops ops = {
> > > > > > +	.dev_start = eth_dev_start,
> > > > > > +	.dev_stop = eth_dev_stop,
> > > > > > +	.dev_close = eth_dev_close,
> > > > > > +	.dev_configure = eth_dev_configure,
> > > > > > +	.dev_infos_get = eth_dev_info,
> > > > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > > > +	.rx_queue_release = eth_queue_release,
> > > > > > +	.tx_queue_release = eth_queue_release,
> > > > > > +	.link_update = eth_link_update,
> > > > > > +	.stats_get = eth_stats_get,
> > > > > > +	.stats_reset = eth_stats_reset,
> > > > > > +};
> > > > > > +
> > > > > > +/*
> > > > > > + * Opens an AF_PACKET socket
> > > > > > + */
> > > > > > +static int
> > > > > > +open_packet_iface(const char *key __rte_unused,
> > > > > > +                  const char *value __rte_unused,
> > > > > > +                  void *extra_args)
> > > > > > +{
> > > > > > +	int *sockfd = extra_args;
> > > > > > +
> > > > > > +	/* Open an AF_PACKET socket... */
> > > > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +	if (*sockfd == -1) {
> > > > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_pmd_init_internals(const char *name,
> > > > > > +                       const int sockfd,
> > > > > > +                       const unsigned nb_queues,
> > > > > > +                       unsigned int blocksize,
> > > > > > +                       unsigned int blockcnt,
> > > > > > +                       unsigned int framesize,
> > > > > > +                       unsigned int framecnt,
> > > > > > +                       const unsigned numa_node,
> > > > > > +                       struct pmd_internals **internals,
> > > > > > +                       struct rte_eth_dev **eth_dev,
> > > > > > +                       struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct rte_eth_dev_data *data = NULL;
> > > > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	struct ifreq ifr;
> > > > > > +	size_t ifnamelen;
> > > > > > +	unsigned k_idx;
> > > > > > +	struct sockaddr_ll sockaddr;
> > > > > > +	struct tpacket_req *req;
> > > > > > +	struct pkt_rx_queue *rx_queue;
> > > > > > +	struct pkt_tx_queue *tx_queue;
> > > > > > +	int rc, tpver, discard, bypass;
> > > > > > +	unsigned int i, q, rdsize;
> > > > > > +	int qsockfd, fanout_arg;
> > > > > > +
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > > > +			break;
> > > > > > +	}
> > > > > > +	if (pair == NULL) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD,
> > > > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > > > +		name, numa_node);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > > > +	 * and internal (private) data
> > > > > > +	 */
> > > > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > > > +	if (data == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > > > +	if (pci_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > > > +	                                0, numa_node);
> > > > > > +	if (*internals == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	req = &((*internals)->req);
> > > > > > +
> > > > > > +	req->tp_block_size = blocksize;
> > > > > > +	req->tp_block_nr = blockcnt;
> > > > > > +	req->tp_frame_size = framesize;
> > > > > > +	req->tp_frame_nr = framecnt;
> > > > > > +
> > > > > > +	ifnamelen = strlen(pair->value);
> > > > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > > > +	} else {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: I/F name too long (%s)\n",
> > > > > > +			name, pair->value);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > > > +
> > > > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > > > +
> > > > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > > > +	sockaddr.sll_family = AF_PACKET;
> > > > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > > > +
> > > > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > > > +
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +		if (qsockfd == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > > > +			        name);
> > > > > > +			return -1;
> > > > > > +		}
> > > > > > +
> > > > > > +		tpver = TPACKET_V2;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > > > +				&tpver, sizeof(tpver));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		discard = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > > > +				&discard, sizeof(discard));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_LOSS on "
> > > > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		bypass = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > > > +				&bypass, sizeof(bypass));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > > > +			        pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > > > +				    qsockfd, 0);
> > > > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > > > +				name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		/* rdsize is same for both Tx and Rx */
> > > > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > > > +
> > > > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		rx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > > > +
> > > > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		tx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > > > +			        name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > > > +				"for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	/* reserve an ethdev entry */
> > > > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > > > +	if (*eth_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now put it all together
> > > > > > +	 * - store queue data in internals,
> > > > > > +	 * - store numa_node info in pci_driver
> > > > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > > > +	 */
> > > > > > +
> > > > > > +	(*internals)->nb_queues = nb_queues;
> > > > > > +
> > > > > > +	data->dev_private = *internals;
> > > > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > > > +	data->dev_link = pmd_link;
> > > > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > > > +
> > > > > > +	pci_dev->numa_node = numa_node;
> > > > > > +
> > > > > > +	(*eth_dev)->data = data;
> > > > > > +	(*eth_dev)->dev_ops = &ops;
> > > > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +error:
> > > > > > +	if (data)
> > > > > > +		rte_free(data);
> > > > > > +	if (pci_dev)
> > > > > > +		rte_free(pci_dev);
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		if ((*internals)->rx_queue[q].rd)
> > > > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > > > +		if ((*internals)->tx_queue[q].rd)
> > > > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > > > +	}
> > > > > > +	if (*internals)
> > > > > > +		rte_free(*internals);
> > > > > > +	return -1;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_eth_from_packet(const char *name,
> > > > > > +                    int const *sockfd,
> > > > > > +                    const unsigned numa_node,
> > > > > > +                    struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = NULL;
> > > > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	unsigned k_idx;
> > > > > > +	unsigned int blockcount;
> > > > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > > > +	unsigned int qpairs = 1;
> > > > > > +
> > > > > > +	/* do some parameter checking */
> > > > > > +	if (*sockfd < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Walk arguments for configurable settings
> > > > > > +	 */
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > > > +			qpairs = atoi(pair->value);
> > > > > > +			if (qpairs < 1 ||
> > > > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid qpairs value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > > > +			blocksize = atoi(pair->value);
> > > > > > +			if (!blocksize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid blocksize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > > > +			framesize = atoi(pair->value);
> > > > > > +			if (!framesize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framesize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > > > +			framecount = atoi(pair->value);
> > > > > > +			if (!framecount) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framecount value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	if (framesize > blocksize) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > > > +		        name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	blockcount = framecount / (blocksize / framesize);
> > > > > > +	if (!blockcount) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > > > +
> > > > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > > > +	                           blocksize, blockcount,
> > > > > > +	                           framesize, framecount,
> > > > > > +	                           numa_node, &internals, &eth_dev,
> > > > > > +	                           kvlist) < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +int
> > > > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > > > +{
> > > > > > +	unsigned numa_node;
> > > > > > +	int ret;
> > > > > > +	struct rte_kvargs *kvlist;
> > > > > > +	int sockfd = -1;
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > > > +
> > > > > > +	numa_node = rte_socket_id();
> > > > > > +
> > > > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > > > +	if (kvlist == NULL)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > > > +	 * reading / writing
> > > > > > +	 */
> > > > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > > > +
> > > > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > > > +		                         &open_packet_iface, &sockfd);
> > > > > > +		if (ret < 0)
> > > > > > +			return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > > > +	close(sockfd); /* no longer needed */
> > > > > > +
> > > > > > +	if (ret < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct rte_driver pmd_packet_drv = {
> > > > > > +	.name = "eth_packet",
> > > > > > +	.type = PMD_VDEV,
> > > > > > +	.init = rte_pmd_packet_devinit,
> > > > > > +};
> > > > > > +
> > > > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..f685611da3e9
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > @@ -0,0 +1,55 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > > > +#define _RTE_ETH_PACKET_H_
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +extern "C" {
> > > > > > +#endif
> > > > > > +
> > > > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > > > +
> > > > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > > > +
> > > > > > +/**
> > > > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > > > + * configured on command line.
> > > > > > + */
> > > > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > > +#endif
> > > > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > > > --- a/mk/rte.app.mk
> > > > > > +++ b/mk/rte.app.mk
> > > > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > > > >  endif
> > > > > >
> > > > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > > > +LDLIBS += -lrte_pmd_packet
> > > > > > +endif
> > > > > > +
> > > > > >  endif # plugins
> > > > > >
> > > > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > > > --
> > > > > > 1.9.3
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > John W. Linville		Someday the world will need a hero, and you
> > > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > > >
> > > 
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> > 
>
  
Zhou, Danny Sept. 15, 2014, 3:43 p.m. UTC | #15
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Monday, September 15, 2014 11:10 PM
> To: Zhou, Danny
> Cc: John W. Linville; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > -----Original Message-----
> > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > Sent: Saturday, September 13, 2014 2:54 AM
> > > To: Zhou, Danny
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > >
> > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > I am concerned about its performance caused by too many
> > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > which are mapped to user space, and then those packets to be copied
> > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > copies which brings significant negative performance impact. We
> > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > submitted to dpdk.org.
> > >
> > > Admittedly, this is not as good a performer as most of the existing
> > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > previously indicate that it performed better than the pcap-based PMD.
> >
> > Yes, slightly higher but makes no big difference.
> >
> Do you have numbers for this?  It seems to me faster is faster as long as its
> statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> to scale to multple cpus more easily than the pcap pmd, as it can make use of
> the AF_PACKET fanout feature.

For 64B small packet, 1.35M pps with 1 queue. As both pcap and AF_PACKET PMDs depend on interrupt 
based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt 
two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?

> 
> > > I look forward to seeing the changes you mention -- they sound very
> > > exciting.  But, they will still require both networking core and
> > > driver changes in the kernel.  And as I understand things today,
> > > the userland code will still need at least some knowledge of specific
> > > devices and how they layout their packet descriptors, etc.  So while
> > > those changes sound very promising, they will still have certain
> > > drawbacks in common with the current situation.
> >
> > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> device-specific
> > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will
> be much easier
> > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> >
> 
> Not sure how this relates, what you're describing is the feature intel has been
> working on to augment kernel drivers to provide better throughput via direct
> hardware access to user space.  Johns PMD provides ubiquitous function on all
> hardware. I'm not sure how the desire for one implies the other isn't valuable?
> 

Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make it easily 
adopted by other NIC vendors.

> > > It seems like the changes you mention will still need some sort of
> > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > separate from the code I already posted?  Or did you add that work
> > > on top of mine?
> > >
> >
> > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev
> library to do device
> > probe and support new socket options.
> >
> 
> Ok, but again, PMD's are independent, and serve different needs.  If they're use
> is at all overlapping from a functional standpoint, take this one now, and
> deprecate it when a better one comes along.  Though from your description it
> seems like both have a valid place in the ecosystem.
> 

I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.

> Neil
> 
> > > John
> > >
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of John W. Linville
> > > > > Sent: Saturday, September 13, 2014 2:05 AM
> > > > > To: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > >
> > > > > Ping?  Are there objections to this patch from mid-July?
> > > > >
> > > > > John
> > > > >
> > > > > On Mon, Jul 14, 2014 at 02:24:50PM -0400, John W. Linville wrote:
> > > > > > This is a Linux-specific virtual PMD driver backed by an AF_PACKET
> > > > > > socket.  This implementation uses mmap'ed ring buffers to limit copying
> > > > > > and user/kernel transitions.  The PACKET_FANOUT_HASH behavior of
> > > > > > AF_PACKET is used for frame reception.  In the current implementation,
> > > > > > Tx and Rx queues are always paired, and therefore are always equal
> > > > > > in number -- changing this would be a Simple Matter Of Programming.
> > > > > >
> > > > > > Interfaces of this type are created with a command line option like
> > > > > > "--vdev=eth_packet0,iface=...".  There are a number of options availabe
> > > > > > as arguments:
> > > > > >
> > > > > >  - Interface is chosen by "iface" (required)
> > > > > >  - Number of queue pairs set by "qpairs" (optional, default: 1)
> > > > > >  - AF_PACKET MMAP block size set by "blocksz" (optional, default: 4096)
> > > > > >  - AF_PACKET MMAP frame size set by "framesz" (optional, default: 2048)
> > > > > >  - AF_PACKET MMAP frame count set by "framecnt" (optional, default: 512)
> > > > > >
> > > > > > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > > > > > ---
> > > > > > This PMD is intended to provide a means for using DPDK on a broad
> > > > > > range of hardware without hardware-specific PMDs and (hopefully)
> > > > > > with better performance than what PCAP offers in Linux.  This might
> > > > > > be useful as a development platform for DPDK applications when
> > > > > > DPDK-supported hardware is expensive or unavailable.
> > > > > >
> > > > > > New in v2:
> > > > > >
> > > > > > -- fixup some style issues found by check patch
> > > > > > -- use if_index as part of fanout group ID
> > > > > > -- set default number of queue pairs to 1
> > > > > >
> > > > > >  config/common_bsdapp                   |   5 +
> > > > > >  config/common_linuxapp                 |   5 +
> > > > > >  lib/Makefile                           |   1 +
> > > > > >  lib/librte_eal/linuxapp/eal/Makefile   |   1 +
> > > > > >  lib/librte_pmd_packet/Makefile         |  60 +++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.c | 826 +++++++++++++++++++++++++++++++++
> > > > > >  lib/librte_pmd_packet/rte_eth_packet.h |  55 +++
> > > > > >  mk/rte.app.mk                          |   4 +
> > > > > >  8 files changed, 957 insertions(+)
> > > > > >  create mode 100644 lib/librte_pmd_packet/Makefile
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.c
> > > > > >  create mode 100644 lib/librte_pmd_packet/rte_eth_packet.h
> > > > > >
> > > > > > diff --git a/config/common_bsdapp b/config/common_bsdapp
> > > > > > index 943dce8f1ede..c317f031278e 100644
> > > > > > --- a/config/common_bsdapp
> > > > > > +++ b/config/common_bsdapp
> > > > > > @@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=n
> > > > > > +
> > > > > > +#
> > > > > >  # Do prefetch of packet data within PMD driver receive function
> > > > > >  #
> > > > > >  CONFIG_RTE_PMD_PACKET_PREFETCH=y
> > > > > > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > > > > > index 7bf5d80d4e26..f9e7bc3015ec 100644
> > > > > > --- a/config/common_linuxapp
> > > > > > +++ b/config/common_linuxapp
> > > > > > @@ -249,6 +249,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=n
> > > > > >  CONFIG_RTE_LIBRTE_PMD_BOND=y
> > > > > >
> > > > > >  #
> > > > > > +# Compile software PMD backed by AF_PACKET sockets (Linux only)
> > > > > > +#
> > > > > > +CONFIG_RTE_LIBRTE_PMD_PACKET=y
> > > > > > +
> > > > > > +#
> > > > > >  # Compile Xen PMD
> > > > > >  #
> > > > > >  CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
> > > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > > index 10c5bb3045bc..930fadf29898 100644
> > > > > > --- a/lib/Makefile
> > > > > > +++ b/lib/Makefile
> > > > > > @@ -47,6 +47,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
> > > > > > +DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
> > > > > >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
> > > > > > diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > index 756d6b0c9301..feed24a63272 100644
> > > > > > --- a/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > +++ b/lib/librte_eal/linuxapp/eal/Makefile
> > > > > > @@ -44,6 +44,7 @@ CFLAGS += -I$(RTE_SDK)/lib/librte_ether
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
> > > > > > +CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
> > > > > >  CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
> > > > > >  CFLAGS += $(WERROR_FLAGS) -O3
> > > > > >
> > > > > > diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
> > > > > > new file mode 100644
> > > > > > index 000000000000..e1266fb992cd
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/Makefile
> > > > > > @@ -0,0 +1,60 @@
> > > > > > +#   BSD LICENSE
> > > > > > +#
> > > > > > +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> > > > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > +#   Copyright(c) 2014 6WIND S.A.
> > > > > > +#   All rights reserved.
> > > > > > +#
> > > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > > +#   modification, are permitted provided that the following conditions
> > > > > > +#   are met:
> > > > > > +#
> > > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > > +#       the documentation and/or other materials provided with the
> > > > > > +#       distribution.
> > > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > > +#       contributors may be used to endorse or promote products derived
> > > > > > +#       from this software without specific prior written permission.
> > > > > > +#
> > > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > > +
> > > > > > +#
> > > > > > +# library name
> > > > > > +#
> > > > > > +LIB = librte_pmd_packet.a
> > > > > > +
> > > > > > +CFLAGS += -O3
> > > > > > +CFLAGS += $(WERROR_FLAGS)
> > > > > > +
> > > > > > +#
> > > > > > +# all source are stored in SRCS-y
> > > > > > +#
> > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
> > > > > > +
> > > > > > +#
> > > > > > +# Export include files
> > > > > > +#
> > > > > > +SYMLINK-y-include += rte_eth_packet.h
> > > > > > +
> > > > > > +# this lib depends upon:
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
> > > > > > +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
> > > > > > +
> > > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..9c82d16e730f
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.c
> > > > > > @@ -0,0 +1,826 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
> > > > > > + *
> > > > > > + *   Originally based upon librte_pmd_pcap code:
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   Copyright(c) 2014 6WIND S.A.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#include <rte_mbuf.h>
> > > > > > +#include <rte_ethdev.h>
> > > > > > +#include <rte_malloc.h>
> > > > > > +#include <rte_kvargs.h>
> > > > > > +#include <rte_dev.h>
> > > > > > +
> > > > > > +#include <linux/if_ether.h>
> > > > > > +#include <linux/if_packet.h>
> > > > > > +#include <arpa/inet.h>
> > > > > > +#include <net/if.h>
> > > > > > +#include <sys/types.h>
> > > > > > +#include <sys/socket.h>
> > > > > > +#include <sys/ioctl.h>
> > > > > > +#include <sys/mman.h>
> > > > > > +#include <unistd.h>
> > > > > > +#include <poll.h>
> > > > > > +
> > > > > > +#include "rte_eth_packet.h"
> > > > > > +
> > > > > > +#define ETH_PACKET_IFACE_ARG		"iface"
> > > > > > +#define ETH_PACKET_NUM_Q_ARG		"qpairs"
> > > > > > +#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
> > > > > > +#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
> > > > > > +#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
> > > > > > +
> > > > > > +#define DFLT_BLOCK_SIZE		(1 << 12)
> > > > > > +#define DFLT_FRAME_SIZE		(1 << 11)
> > > > > > +#define DFLT_FRAME_COUNT	(1 << 9)
> > > > > > +
> > > > > > +struct pkt_rx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	struct rte_mempool *mb_pool;
> > > > > > +
> > > > > > +	volatile unsigned long rx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pkt_tx_queue {
> > > > > > +	int sockfd;
> > > > > > +
> > > > > > +	struct iovec *rd;
> > > > > > +	uint8_t *map;
> > > > > > +	unsigned int framecount;
> > > > > > +	unsigned int framenum;
> > > > > > +
> > > > > > +	volatile unsigned long tx_pkts;
> > > > > > +	volatile unsigned long err_pkts;
> > > > > > +};
> > > > > > +
> > > > > > +struct pmd_internals {
> > > > > > +	unsigned nb_queues;
> > > > > > +
> > > > > > +	int if_index;
> > > > > > +	struct ether_addr eth_addr;
> > > > > > +
> > > > > > +	struct tpacket_req req;
> > > > > > +
> > > > > > +	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
> > > > > > +};
> > > > > > +
> > > > > > +static const char *valid_arguments[] = {
> > > > > > +	ETH_PACKET_IFACE_ARG,
> > > > > > +	ETH_PACKET_NUM_Q_ARG,
> > > > > > +	ETH_PACKET_BLOCKSIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMESIZE_ARG,
> > > > > > +	ETH_PACKET_FRAMECOUNT_ARG,
> > > > > > +	NULL
> > > > > > +};
> > > > > > +
> > > > > > +static const char *drivername = "AF_PACKET PMD";
> > > > > > +
> > > > > > +static struct rte_eth_link pmd_link = {
> > > > > > +	.link_speed = 10000,
> > > > > > +	.link_duplex = ETH_LINK_FULL_DUPLEX,
> > > > > > +	.link_status = 0
> > > > > > +};
> > > > > > +
> > > > > > +static uint16_t
> > > > > > +eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	struct pkt_rx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_rx = 0;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Reads the given number of packets from the AF_PACKET socket one by
> > > > > > +	 * one and copies the packet data into a newly allocated mbuf.
> > > > > > +	 */
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +		if ((ppd->tp_status & TP_STATUS_USER) == 0)
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* allocate the next mbuf */
> > > > > > +		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
> > > > > > +		if (unlikely(mbuf == NULL))
> > > > > > +			break;
> > > > > > +
> > > > > > +		/* packet will fit in the mbuf, go ahead and receive it */
> > > > > > +		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
> > > > > > +		pbuf = (uint8_t *) ppd + ppd->tp_mac;
> > > > > > +		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_KERNEL;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +
> > > > > > +		/* account for the receive frame */
> > > > > > +		bufs[i] = mbuf;
> > > > > > +		num_rx++;
> > > > > > +	}
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->rx_pkts += num_rx;
> > > > > > +	return num_rx;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Callback to handle sending packets through a real NIC.
> > > > > > + */
> > > > > > +static uint16_t
> > > > > > +eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > > > > > +{
> > > > > > +	struct tpacket2_hdr *ppd;
> > > > > > +	struct rte_mbuf *mbuf;
> > > > > > +	uint8_t *pbuf;
> > > > > > +	unsigned int framecount, framenum;
> > > > > > +	struct pollfd pfd;
> > > > > > +	struct pkt_tx_queue *pkt_q = queue;
> > > > > > +	uint16_t num_tx = 0;
> > > > > > +	int i;
> > > > > > +
> > > > > > +	if (unlikely(nb_pkts == 0))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	memset(&pfd, 0, sizeof(pfd));
> > > > > > +	pfd.fd = pkt_q->sockfd;
> > > > > > +	pfd.events = POLLOUT;
> > > > > > +	pfd.revents = 0;
> > > > > > +
> > > > > > +	framecount = pkt_q->framecount;
> > > > > > +	framenum = pkt_q->framenum;
> > > > > > +	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > > > +		/* point at the next incoming frame */
> > > > > > +		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > > > > > +		    (poll(&pfd, 1, -1) < 0))
> > > > > > +				continue;
> > > > > > +
> > > > > > +		/* copy the tx frame data */
> > > > > > +		mbuf = bufs[num_tx];
> > > > > > +		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> > > > > > +			sizeof(struct sockaddr_ll);
> > > > > > +		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
> > > > > > +		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
> > > > > > +
> > > > > > +		/* release incoming frame and advance ring buffer */
> > > > > > +		ppd->tp_status = TP_STATUS_SEND_REQUEST;
> > > > > > +		if (++framenum >= framecount)
> > > > > > +			framenum = 0;
> > > > > > +		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
> > > > > > +
> > > > > > +		num_tx++;
> > > > > > +		rte_pktmbuf_free(mbuf);
> > > > > > +	}
> > > > > > +
> > > > > > +	/* kick-off transmits */
> > > > > > +	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > > > > +
> > > > > > +	pkt_q->framenum = framenum;
> > > > > > +	pkt_q->tx_pkts += num_tx;
> > > > > > +	pkt_q->err_pkts += nb_pkts - num_tx;
> > > > > > +	return num_tx;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_start(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	dev->data->dev_link.link_status = 1;
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * This function gets called when the current port gets stopped.
> > > > > > + */
> > > > > > +static void
> > > > > > +eth_dev_stop(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	int sockfd;
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internals->nb_queues; i++) {
> > > > > > +		sockfd = internals->rx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +		sockfd = internals->tx_queue[i].sockfd;
> > > > > > +		if (sockfd != -1)
> > > > > > +			close(sockfd);
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->dev_link.link_status = 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev_info->driver_name = drivername;
> > > > > > +	dev_info->if_index = internals->if_index;
> > > > > > +	dev_info->max_mac_addrs = 1;
> > > > > > +	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
> > > > > > +	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
> > > > > > +	dev_info->min_rx_bufsize = 0;
> > > > > > +	dev_info->pci_dev = NULL;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> > > > > > +{
> > > > > > +	unsigned i, imax;
> > > > > > +	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> > > > > > +	const struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	memset(igb_stats, 0, sizeof(*igb_stats));
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> > > > > > +		rx_total += igb_stats->q_ipackets[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
> > > > > > +	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
> > > > > > +	for (i = 0; i < imax; i++) {
> > > > > > +		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> > > > > > +		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
> > > > > > +		tx_total += igb_stats->q_opackets[i];
> > > > > > +		tx_err_total += igb_stats->q_errors[i];
> > > > > > +	}
> > > > > > +
> > > > > > +	igb_stats->ipackets = rx_total;
> > > > > > +	igb_stats->opackets = tx_total;
> > > > > > +	igb_stats->oerrors = tx_err_total;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_stats_reset(struct rte_eth_dev *dev)
> > > > > > +{
> > > > > > +	unsigned i;
> > > > > > +	struct pmd_internals *internal = dev->data->dev_private;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++)
> > > > > > +		internal->rx_queue[i].rx_pkts = 0;
> > > > > > +
> > > > > > +	for (i = 0; i < internal->nb_queues; i++) {
> > > > > > +		internal->tx_queue[i].tx_pkts = 0;
> > > > > > +		internal->tx_queue[i].err_pkts = 0;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_dev_close(struct rte_eth_dev *dev __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +eth_queue_release(void *q __rte_unused)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_link_update(struct rte_eth_dev *dev __rte_unused,
> > > > > > +                int wait_to_complete __rte_unused)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_rx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t rx_queue_id,
> > > > > > +                   uint16_t nb_rx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_rxconf *rx_conf __rte_unused,
> > > > > > +                   struct rte_mempool *mb_pool)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
> > > > > > +	struct rte_pktmbuf_pool_private *mbp_priv;
> > > > > > +	uint16_t buf_size;
> > > > > > +
> > > > > > +	pkt_q->mb_pool = mb_pool;
> > > > > > +
> > > > > > +	/* Now get the space available for data in the mbuf */
> > > > > > +	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
> > > > > > +	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
> > > > > > +	                       RTE_PKTMBUF_HEADROOM);
> > > > > > +
> > > > > > +	if (ETH_FRAME_LEN > buf_size) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
> > > > > > +			dev->data->name, ETH_FRAME_LEN, buf_size);
> > > > > > +		return -ENOMEM;
> > > > > > +	}
> > > > > > +
> > > > > > +	dev->data->rx_queues[rx_queue_id] = pkt_q;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +eth_tx_queue_setup(struct rte_eth_dev *dev,
> > > > > > +                   uint16_t tx_queue_id,
> > > > > > +                   uint16_t nb_tx_desc __rte_unused,
> > > > > > +                   unsigned int socket_id __rte_unused,
> > > > > > +                   const struct rte_eth_txconf *tx_conf __rte_unused)
> > > > > > +{
> > > > > > +
> > > > > > +	struct pmd_internals *internals = dev->data->dev_private;
> > > > > > +
> > > > > > +	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct eth_dev_ops ops = {
> > > > > > +	.dev_start = eth_dev_start,
> > > > > > +	.dev_stop = eth_dev_stop,
> > > > > > +	.dev_close = eth_dev_close,
> > > > > > +	.dev_configure = eth_dev_configure,
> > > > > > +	.dev_infos_get = eth_dev_info,
> > > > > > +	.rx_queue_setup = eth_rx_queue_setup,
> > > > > > +	.tx_queue_setup = eth_tx_queue_setup,
> > > > > > +	.rx_queue_release = eth_queue_release,
> > > > > > +	.tx_queue_release = eth_queue_release,
> > > > > > +	.link_update = eth_link_update,
> > > > > > +	.stats_get = eth_stats_get,
> > > > > > +	.stats_reset = eth_stats_reset,
> > > > > > +};
> > > > > > +
> > > > > > +/*
> > > > > > + * Opens an AF_PACKET socket
> > > > > > + */
> > > > > > +static int
> > > > > > +open_packet_iface(const char *key __rte_unused,
> > > > > > +                  const char *value __rte_unused,
> > > > > > +                  void *extra_args)
> > > > > > +{
> > > > > > +	int *sockfd = extra_args;
> > > > > > +
> > > > > > +	/* Open an AF_PACKET socket... */
> > > > > > +	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +	if (*sockfd == -1) {
> > > > > > +		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_pmd_init_internals(const char *name,
> > > > > > +                       const int sockfd,
> > > > > > +                       const unsigned nb_queues,
> > > > > > +                       unsigned int blocksize,
> > > > > > +                       unsigned int blockcnt,
> > > > > > +                       unsigned int framesize,
> > > > > > +                       unsigned int framecnt,
> > > > > > +                       const unsigned numa_node,
> > > > > > +                       struct pmd_internals **internals,
> > > > > > +                       struct rte_eth_dev **eth_dev,
> > > > > > +                       struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct rte_eth_dev_data *data = NULL;
> > > > > > +	struct rte_pci_device *pci_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	struct ifreq ifr;
> > > > > > +	size_t ifnamelen;
> > > > > > +	unsigned k_idx;
> > > > > > +	struct sockaddr_ll sockaddr;
> > > > > > +	struct tpacket_req *req;
> > > > > > +	struct pkt_rx_queue *rx_queue;
> > > > > > +	struct pkt_tx_queue *tx_queue;
> > > > > > +	int rc, tpver, discard, bypass;
> > > > > > +	unsigned int i, q, rdsize;
> > > > > > +	int qsockfd, fanout_arg;
> > > > > > +
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
> > > > > > +			break;
> > > > > > +	}
> > > > > > +	if (pair == NULL) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: no interface specified for AF_PACKET ethdev\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD,
> > > > > > +		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
> > > > > > +		name, numa_node);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now do all data allocation - for eth_dev structure, dummy pci driver
> > > > > > +	 * and internal (private) data
> > > > > > +	 */
> > > > > > +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> > > > > > +	if (data == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
> > > > > > +	if (pci_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	*internals = rte_zmalloc_socket(name, sizeof(**internals),
> > > > > > +	                                0, numa_node);
> > > > > > +	if (*internals == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	req = &((*internals)->req);
> > > > > > +
> > > > > > +	req->tp_block_size = blocksize;
> > > > > > +	req->tp_block_nr = blockcnt;
> > > > > > +	req->tp_frame_size = framesize;
> > > > > > +	req->tp_frame_nr = framecnt;
> > > > > > +
> > > > > > +	ifnamelen = strlen(pair->value);
> > > > > > +	if (ifnamelen < sizeof(ifr.ifr_name)) {
> > > > > > +		memcpy(ifr.ifr_name, pair->value, ifnamelen);
> > > > > > +		ifr.ifr_name[ifnamelen] = '\0';
> > > > > > +	} else {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: I/F name too long (%s)\n",
> > > > > > +			name, pair->value);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFINDEX)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	(*internals)->if_index = ifr.ifr_ifindex;
> > > > > > +
> > > > > > +	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: ioctl failed (SIOCGIFHWADDR)\n",
> > > > > > +		        name);
> > > > > > +		goto error;
> > > > > > +	}
> > > > > > +	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
> > > > > > +
> > > > > > +	memset(&sockaddr, 0, sizeof(sockaddr));
> > > > > > +	sockaddr.sll_family = AF_PACKET;
> > > > > > +	sockaddr.sll_protocol = htons(ETH_P_ALL);
> > > > > > +	sockaddr.sll_ifindex = (*internals)->if_index;
> > > > > > +
> > > > > > +	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
> > > > > > +	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
> > > > > > +	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
> > > > > > +
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		/* Open an AF_PACKET socket for this queue... */
> > > > > > +		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > > > > +		if (qsockfd == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +			        "%s: could not open AF_PACKET socket\n",
> > > > > > +			        name);
> > > > > > +			return -1;
> > > > > > +		}
> > > > > > +
> > > > > > +		tpver = TPACKET_V2;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
> > > > > > +				&tpver, sizeof(tpver));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_VERSION on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		discard = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
> > > > > > +				&discard, sizeof(discard));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_LOSS on "
> > > > > > +			        "AF_PACKET socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		bypass = 1;
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
> > > > > > +				&bypass, sizeof(bypass));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_QDISC_BYPASS "
> > > > > > +			        "on AF_PACKET socket for %s\n", name,
> > > > > > +			        pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_RX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_TX_RING on AF_PACKET "
> > > > > > +				"socket for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rx_queue = &((*internals)->rx_queue[q]);
> > > > > > +		rx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
> > > > > > +				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
> > > > > > +				    qsockfd, 0);
> > > > > > +		if (rx_queue->map == MAP_FAILED) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: call to mmap failed on AF_PACKET socket for %s\n",
> > > > > > +				name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		/* rdsize is same for both Tx and Rx */
> > > > > > +		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
> > > > > > +
> > > > > > +		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
> > > > > > +			rx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		rx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		tx_queue = &((*internals)->tx_queue[q]);
> > > > > > +		tx_queue->framecount = req->tp_frame_nr;
> > > > > > +
> > > > > > +		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
> > > > > > +
> > > > > > +		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
> > > > > > +		for (i = 0; i < req->tp_frame_nr; ++i) {
> > > > > > +			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
> > > > > > +			tx_queue->rd[i].iov_len = req->tp_frame_size;
> > > > > > +		}
> > > > > > +		tx_queue->sockfd = qsockfd;
> > > > > > +
> > > > > > +		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not bind AF_PACKET socket to %s\n",
> > > > > > +			        name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +
> > > > > > +		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
> > > > > > +				&fanout_arg, sizeof(fanout_arg));
> > > > > > +		if (rc == -1) {
> > > > > > +			RTE_LOG(ERR, PMD,
> > > > > > +				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
> > > > > > +				"for %s\n", name, pair->value);
> > > > > > +			goto error;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	/* reserve an ethdev entry */
> > > > > > +	*eth_dev = rte_eth_dev_allocate(name);
> > > > > > +	if (*eth_dev == NULL)
> > > > > > +		goto error;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * now put it all together
> > > > > > +	 * - store queue data in internals,
> > > > > > +	 * - store numa_node info in pci_driver
> > > > > > +	 * - point eth_dev_data to internals and pci_driver
> > > > > > +	 * - and point eth_dev structure to new eth_dev_data structure
> > > > > > +	 */
> > > > > > +
> > > > > > +	(*internals)->nb_queues = nb_queues;
> > > > > > +
> > > > > > +	data->dev_private = *internals;
> > > > > > +	data->port_id = (*eth_dev)->data->port_id;
> > > > > > +	data->nb_rx_queues = (uint16_t)nb_queues;
> > > > > > +	data->nb_tx_queues = (uint16_t)nb_queues;
> > > > > > +	data->dev_link = pmd_link;
> > > > > > +	data->mac_addrs = &(*internals)->eth_addr;
> > > > > > +
> > > > > > +	pci_dev->numa_node = numa_node;
> > > > > > +
> > > > > > +	(*eth_dev)->data = data;
> > > > > > +	(*eth_dev)->dev_ops = &ops;
> > > > > > +	(*eth_dev)->pci_dev = pci_dev;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +error:
> > > > > > +	if (data)
> > > > > > +		rte_free(data);
> > > > > > +	if (pci_dev)
> > > > > > +		rte_free(pci_dev);
> > > > > > +	for (q = 0; q < nb_queues; q++) {
> > > > > > +		if ((*internals)->rx_queue[q].rd)
> > > > > > +			rte_free((*internals)->rx_queue[q].rd);
> > > > > > +		if ((*internals)->tx_queue[q].rd)
> > > > > > +			rte_free((*internals)->tx_queue[q].rd);
> > > > > > +	}
> > > > > > +	if (*internals)
> > > > > > +		rte_free(*internals);
> > > > > > +	return -1;
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +rte_eth_from_packet(const char *name,
> > > > > > +                    int const *sockfd,
> > > > > > +                    const unsigned numa_node,
> > > > > > +                    struct rte_kvargs *kvlist)
> > > > > > +{
> > > > > > +	struct pmd_internals *internals = NULL;
> > > > > > +	struct rte_eth_dev *eth_dev = NULL;
> > > > > > +	struct rte_kvargs_pair *pair = NULL;
> > > > > > +	unsigned k_idx;
> > > > > > +	unsigned int blockcount;
> > > > > > +	unsigned int blocksize = DFLT_BLOCK_SIZE;
> > > > > > +	unsigned int framesize = DFLT_FRAME_SIZE;
> > > > > > +	unsigned int framecount = DFLT_FRAME_COUNT;
> > > > > > +	unsigned int qpairs = 1;
> > > > > > +
> > > > > > +	/* do some parameter checking */
> > > > > > +	if (*sockfd < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Walk arguments for configurable settings
> > > > > > +	 */
> > > > > > +	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
> > > > > > +		pair = &kvlist->pairs[k_idx];
> > > > > > +		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
> > > > > > +			qpairs = atoi(pair->value);
> > > > > > +			if (qpairs < 1 ||
> > > > > > +			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid qpairs value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
> > > > > > +			blocksize = atoi(pair->value);
> > > > > > +			if (!blocksize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid blocksize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
> > > > > > +			framesize = atoi(pair->value);
> > > > > > +			if (!framesize) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framesize value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
> > > > > > +			framecount = atoi(pair->value);
> > > > > > +			if (!framecount) {
> > > > > > +				RTE_LOG(ERR, PMD,
> > > > > > +					"%s: invalid framecount value\n",
> > > > > > +				        name);
> > > > > > +				return -1;
> > > > > > +			}
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	if (framesize > blocksize) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
> > > > > > +		        name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	blockcount = framecount / (blocksize / framesize);
> > > > > > +	if (!blockcount) {
> > > > > > +		RTE_LOG(ERR, PMD,
> > > > > > +			"%s: invalid AF_PACKET MMAP parameters\n", name);
> > > > > > +		return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
> > > > > > +	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
> > > > > > +
> > > > > > +	if (rte_pmd_init_internals(name, *sockfd, qpairs,
> > > > > > +	                           blocksize, blockcount,
> > > > > > +	                           framesize, framecount,
> > > > > > +	                           numa_node, &internals, &eth_dev,
> > > > > > +	                           kvlist) < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	eth_dev->rx_pkt_burst = eth_packet_rx;
> > > > > > +	eth_dev->tx_pkt_burst = eth_packet_tx;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +int
> > > > > > +rte_pmd_packet_devinit(const char *name, const char *params)
> > > > > > +{
> > > > > > +	unsigned numa_node;
> > > > > > +	int ret;
> > > > > > +	struct rte_kvargs *kvlist;
> > > > > > +	int sockfd = -1;
> > > > > > +
> > > > > > +	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
> > > > > > +
> > > > > > +	numa_node = rte_socket_id();
> > > > > > +
> > > > > > +	kvlist = rte_kvargs_parse(params, valid_arguments);
> > > > > > +	if (kvlist == NULL)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If iface argument is passed we open the NICs and use them for
> > > > > > +	 * reading / writing
> > > > > > +	 */
> > > > > > +	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
> > > > > > +
> > > > > > +		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
> > > > > > +		                         &open_packet_iface, &sockfd);
> > > > > > +		if (ret < 0)
> > > > > > +			return -1;
> > > > > > +	}
> > > > > > +
> > > > > > +	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
> > > > > > +	close(sockfd); /* no longer needed */
> > > > > > +
> > > > > > +	if (ret < 0)
> > > > > > +		return -1;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct rte_driver pmd_packet_drv = {
> > > > > > +	.name = "eth_packet",
> > > > > > +	.type = PMD_VDEV,
> > > > > > +	.init = rte_pmd_packet_devinit,
> > > > > > +};
> > > > > > +
> > > > > > +PMD_REGISTER_DRIVER(pmd_packet_drv);
> > > > > > diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..f685611da3e9
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_pmd_packet/rte_eth_packet.h
> > > > > > @@ -0,0 +1,55 @@
> > > > > > +/*-
> > > > > > + *   BSD LICENSE
> > > > > > + *
> > > > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > > > + *   All rights reserved.
> > > > > > + *
> > > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > > + *   modification, are permitted provided that the following conditions
> > > > > > + *   are met:
> > > > > > + *
> > > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > > + *       the documentation and/or other materials provided with the
> > > > > > + *       distribution.
> > > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > > + *       contributors may be used to endorse or promote products derived
> > > > > > + *       from this software without specific prior written permission.
> > > > > > + *
> > > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef _RTE_ETH_PACKET_H_
> > > > > > +#define _RTE_ETH_PACKET_H_
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +extern "C" {
> > > > > > +#endif
> > > > > > +
> > > > > > +#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
> > > > > > +
> > > > > > +#define RTE_PMD_PACKET_MAX_RINGS 16
> > > > > > +
> > > > > > +/**
> > > > > > + * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
> > > > > > + * configured on command line.
> > > > > > + */
> > > > > > +int rte_pmd_packet_devinit(const char *name, const char *params);
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > > +#endif
> > > > > > diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> > > > > > index 34dff2a02a05..a6994c4dbe93 100644
> > > > > > --- a/mk/rte.app.mk
> > > > > > +++ b/mk/rte.app.mk
> > > > > > @@ -210,6 +210,10 @@ ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
> > > > > >  LDLIBS += -lrte_pmd_pcap -lpcap
> > > > > >  endif
> > > > > >
> > > > > > +ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
> > > > > > +LDLIBS += -lrte_pmd_packet
> > > > > > +endif
> > > > > > +
> > > > > >  endif # plugins
> > > > > >
> > > > > >  LDLIBS += $(EXECENV_LDLIBS)
> > > > > > --
> > > > > > 1.9.3
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > John W. Linville		Someday the world will need a hero, and you
> > > > > linville@tuxdriver.com			might be all we have.  Be ready.
> > > >
> > >
> > > --
> > > John W. Linville		Someday the world will need a hero, and you
> > > linville@tuxdriver.com			might be all we have.  Be ready.
> >
  
Neil Horman Sept. 15, 2014, 4:22 p.m. UTC | #16
On Mon, Sep 15, 2014 at 03:43:07PM +0000, Zhou, Danny wrote:
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Monday, September 15, 2014 11:10 PM
> > To: Zhou, Danny
> > Cc: John W. Linville; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > 
> > On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > > -----Original Message-----
> > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > Sent: Saturday, September 13, 2014 2:54 AM
> > > > To: Zhou, Danny
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > >
> > > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > > I am concerned about its performance caused by too many
> > > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > > which are mapped to user space, and then those packets to be copied
> > > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > > copies which brings significant negative performance impact. We
> > > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > > submitted to dpdk.org.
> > > >
> > > > Admittedly, this is not as good a performer as most of the existing
> > > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > > previously indicate that it performed better than the pcap-based PMD.
> > >
> > > Yes, slightly higher but makes no big difference.
> > >
> > Do you have numbers for this?  It seems to me faster is faster as long as its
> > statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> > to scale to multple cpus more easily than the pcap pmd, as it can make use of
> > the AF_PACKET fanout feature.
> 
> For 64B small packet, 1.35M pps with 1 queue.
Why did you only test with a single queue?  Multiqueue operation was one of the
big advantages of the AF_PACKET based pmd.  I would expect a single queue setup
to perform in a very simmilar fashion to the pcap PMD

 As both pcap and AF_PACKET PMDs depend on interrupt 
> based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt 
> two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?
Several reasons:
* "High performance" isn't always the key need for end users.  Consider
pre-hardware availablity development phase.

* Better hardware modeling (consider AF_PACKETS multiqueue abiltiy)

* Better scaling (pcap doesn't make use of the fanout features that AF_PACKET
does)

* Space savings, Building the AF_PACKET pmd doesn't require the additional
building/storage of the pcap driver.


> 
> > 
> > > > I look forward to seeing the changes you mention -- they sound very
> > > > exciting.  But, they will still require both networking core and
> > > > driver changes in the kernel.  And as I understand things today,
> > > > the userland code will still need at least some knowledge of specific
> > > > devices and how they layout their packet descriptors, etc.  So while
> > > > those changes sound very promising, they will still have certain
> > > > drawbacks in common with the current situation.
> > >
> > > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> > device-specific
> > > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will
> > be much easier
> > > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > >
> > 
> > Not sure how this relates, what you're describing is the feature intel has been
> > working on to augment kernel drivers to provide better throughput via direct
> > hardware access to user space.  Johns PMD provides ubiquitous function on all
> > hardware. I'm not sure how the desire for one implies the other isn't valuable?
> > 
> 
> Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make it easily 
> adopted by other NIC vendors.
> 
Thats completely irrelevant to the question at hand.  To go with your reasoning,
if performance is the key value of the DPDK, then you should remove all driver
support save for the most performant hardware you have.  By that same token,
you should deprecate the pcap driver in favor of this AF_PACKET driver, because
it has shown performance improvement.

I'm being facetious, of course, but the facts remain: Lack of superior
performance from one PMD to the next does not immediately obviate the need for
one PMD over another, as they quite likely address differing needs.  As you note
the DPDK seeks performance as a key goal, but its an open source project, there
are other needs from other users in play here.  The AF_PACKET pmd provides
superior performance on linux platforms when hardware independence is required.
It differs from the pcap PMD as it uses features that are only available on the
Linux platform, so it stands to reason we should have both.

> > > > It seems like the changes you mention will still need some sort of
> > > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > > separate from the code I already posted?  Or did you add that work
> > > > on top of mine?
> > > >
> > >
> > > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev
> > library to do device
> > > probe and support new socket options.
> > >
> > 
> > Ok, but again, PMD's are independent, and serve different needs.  If they're use
> > is at all overlapping from a functional standpoint, take this one now, and
> > deprecate it when a better one comes along.  Though from your description it
> > seems like both have a valid place in the ecosystem.
> > 
> 
> I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.
> 
What extra maintainer efforts do you think are required here, that wouldn't be
required for any PMD?  To suggest that a given PMD shouldn't be included because
it would require additional effort to maintain holds it to a higher standard
than the PMD's already included.  I don't recall anyone asking if the i40e or
bonding pmds would require additional effort before being integrated.

Neil
  
John W. Linville Sept. 15, 2014, 5:48 p.m. UTC | #17
On Mon, Sep 15, 2014 at 12:22:44PM -0400, Neil Horman wrote:
> On Mon, Sep 15, 2014 at 03:43:07PM +0000, Zhou, Danny wrote:
> > 
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Monday, September 15, 2014 11:10 PM
> > > To: Zhou, Danny
> > > Cc: John W. Linville; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > 
> > > On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > > > -----Original Message-----
> > > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > > Sent: Saturday, September 13, 2014 2:54 AM
> > > > > To: Zhou, Danny
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > >
> > > > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > > > I am concerned about its performance caused by too many
> > > > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > > > which are mapped to user space, and then those packets to be copied
> > > > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > > > copies which brings significant negative performance impact. We
> > > > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > > > submitted to dpdk.org.
> > > > >
> > > > > Admittedly, this is not as good a performer as most of the existing
> > > > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > > > previously indicate that it performed better than the pcap-based PMD.
> > > >
> > > > Yes, slightly higher but makes no big difference.
> > > >
> > > Do you have numbers for this?  It seems to me faster is faster as long as its
> > > statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> > > to scale to multple cpus more easily than the pcap pmd, as it can make use of
> > > the AF_PACKET fanout feature.
> > 
> > For 64B small packet, 1.35M pps with 1 queue.
> Why did you only test with a single queue?  Multiqueue operation was one of the
> big advantages of the AF_PACKET based pmd.  I would expect a single queue setup
> to perform in a very simmilar fashion to the pcap PMD
> 
>  As both pcap and AF_PACKET PMDs depend on interrupt 
> > based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt 
> > two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?
> Several reasons:
> * "High performance" isn't always the key need for end users.  Consider
> pre-hardware availablity development phase.
> 
> * Better hardware modeling (consider AF_PACKETS multiqueue abiltiy)
> 
> * Better scaling (pcap doesn't make use of the fanout features that AF_PACKET
> does)
> 
> * Space savings, Building the AF_PACKET pmd doesn't require the additional
> building/storage of the pcap driver.

This would include not requiring a dependency on libpcap, if nothing else.
 
> > 
> > > 
> > > > > I look forward to seeing the changes you mention -- they sound very
> > > > > exciting.  But, they will still require both networking core and
> > > > > driver changes in the kernel.  And as I understand things today,
> > > > > the userland code will still need at least some knowledge of specific
> > > > > devices and how they layout their packet descriptors, etc.  So while
> > > > > those changes sound very promising, they will still have certain
> > > > > drawbacks in common with the current situation.
> > > >
> > > > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> > > device-specific
> > > > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe it will
> > > be much easier
> > > > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > > >
> > > 
> > > Not sure how this relates, what you're describing is the feature intel has been
> > > working on to augment kernel drivers to provide better throughput via direct
> > > hardware access to user space.  Johns PMD provides ubiquitous function on all
> > > hardware. I'm not sure how the desire for one implies the other isn't valuable?
> > > 
> > 
> > Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make it easily 
> > adopted by other NIC vendors.
> > 
> Thats completely irrelevant to the question at hand.  To go with your reasoning,
> if performance is the key value of the DPDK, then you should remove all driver
> support save for the most performant hardware you have.  By that same token,
> you should deprecate the pcap driver in favor of this AF_PACKET driver, because
> it has shown performance improvement.
> 
> I'm being facetious, of course, but the facts remain: Lack of superior
> performance from one PMD to the next does not immediately obviate the need for
> one PMD over another, as they quite likely address differing needs.  As you note
> the DPDK seeks performance as a key goal, but its an open source project, there
> are other needs from other users in play here.  The AF_PACKET pmd provides
> superior performance on linux platforms when hardware independence is required.
> It differs from the pcap PMD as it uses features that are only available on the
> Linux platform, so it stands to reason we should have both.

IMHO, the biggest deficiency in DPDK is the lack of apps.  Let's face
it, no one really cares about running l2fwd except for testing the
drivers.  What people want is applications.  Providing a PMD to use
while developing an app without requiring specific hardware seems like
a win to me.  The pcap PMD addresses some of that, but it is more of
a stop-gap or special purpose thing (like for playing back captures).

> > > > > It seems like the changes you mention will still need some sort of
> > > > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > > > separate from the code I already posted?  Or did you add that work
> > > > > on top of mine?
> > > > >
> > > >
> > > > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into eth_dev
> > > library to do device
> > > > probe and support new socket options.
> > > >
> > > 
> > > Ok, but again, PMD's are independent, and serve different needs.  If they're use
> > > is at all overlapping from a functional standpoint, take this one now, and
> > > deprecate it when a better one comes along.  Though from your description it
> > > seems like both have a valid place in the ecosystem.
> > > 
> > 
> > I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.
> > 
> What extra maintainer efforts do you think are required here, that wouldn't be
> required for any PMD?  To suggest that a given PMD shouldn't be included because
> it would require additional effort to maintain holds it to a higher standard
> than the PMD's already included.  I don't recall anyone asking if the i40e or
> bonding pmds would require additional effort before being integrated.

Right -- how much maintainer effort is put into the pcap driver
these days?

John
  
Zhou, Danny Sept. 15, 2014, 7:11 p.m. UTC | #18
> -----Original Message-----
> From: John W. Linville [mailto:linville@tuxdriver.com]
> Sent: Tuesday, September 16, 2014 1:48 AM
> To: Neil Horman
> Cc: Zhou, Danny; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> 
> On Mon, Sep 15, 2014 at 12:22:44PM -0400, Neil Horman wrote:
> > On Mon, Sep 15, 2014 at 03:43:07PM +0000, Zhou, Danny wrote:
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Monday, September 15, 2014 11:10 PM
> > > > To: Zhou, Danny
> > > > Cc: John W. Linville; dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > >
> > > > On Fri, Sep 12, 2014 at 08:35:47PM +0000, Zhou, Danny wrote:
> > > > > > -----Original Message-----
> > > > > > From: John W. Linville [mailto:linville@tuxdriver.com]
> > > > > > Sent: Saturday, September 13, 2014 2:54 AM
> > > > > > To: Zhou, Danny
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH v2] librte_pmd_packet: add PMD for AF_PACKET-based virtual devices
> > > > > >
> > > > > > On Fri, Sep 12, 2014 at 06:31:08PM +0000, Zhou, Danny wrote:
> > > > > > > I am concerned about its performance caused by too many
> > > > > > > memcpy(). Specifically, on Rx side, kernel NIC driver needs to copy
> > > > > > > packets to skb, then af_packet copies packets to AF_PACKET buffer
> > > > > > > which are mapped to user space, and then those packets to be copied
> > > > > > > to DPDK mbuf. In addition, 3 copies needed on Tx side. So to run a
> > > > > > > simple DPDK L2/L3 forwarding benchmark, each packet needs 6 packet
> > > > > > > copies which brings significant negative performance impact. We
> > > > > > > had a bifurcated driver prototype that can do zero-copy and achieve
> > > > > > > native DPDK performance, but it depends on base driver and AF_PACKET
> > > > > > > code changes in kernel, John R will be presenting it in coming Linux
> > > > > > > Plumbers Conference. Once kernel adopts it, the relevant PMD will be
> > > > > > > submitted to dpdk.org.
> > > > > >
> > > > > > Admittedly, this is not as good a performer as most of the existing
> > > > > > PMDs.  It serves a different purpose, afterall.  FWIW, you did
> > > > > > previously indicate that it performed better than the pcap-based PMD.
> > > > >
> > > > > Yes, slightly higher but makes no big difference.
> > > > >
> > > > Do you have numbers for this?  It seems to me faster is faster as long as its
> > > > statistically significant.  Even if its not, johns AF_PACKET pmd has the ability
> > > > to scale to multple cpus more easily than the pcap pmd, as it can make use of
> > > > the AF_PACKET fanout feature.
> > >
> > > For 64B small packet, 1.35M pps with 1 queue.
> > Why did you only test with a single queue?  Multiqueue operation was one of the
> > big advantages of the AF_PACKET based pmd.  I would expect a single queue setup
> > to perform in a very simmilar fashion to the pcap PMD
> >
> >  As both pcap and AF_PACKET PMDs depend on interrupt
> > > based NIC kernel drivers, all the DPDK performance optimization techniques are not utilized. Why should DPDK adopt
> > > two similar and poor performant PMDs which cannot demonstrate DPDK' key value "high performance"?
> > Several reasons:
> > * "High performance" isn't always the key need for end users.  Consider
> > pre-hardware availablity development phase.
> >
> > * Better hardware modeling (consider AF_PACKETS multiqueue abiltiy)
> >
> > * Better scaling (pcap doesn't make use of the fanout features that AF_PACKET
> > does)
> >
> > * Space savings, Building the AF_PACKET pmd doesn't require the additional
> > building/storage of the pcap driver.
> 
> This would include not requiring a dependency on libpcap, if nothing else.

librte_pmd_pcap and librte_pmd_packet are both DPDK wrapper libraries on top of libpcap library and AF_PACKET module respectively, 
so they are not born for high performance, which is truly understandable. DPDK is moving toward to open to a larger public of data center
consumers who do not care about very high performance, so from that angle, it makes sense to adopt librte_pmd_packet in my mind.

> 
> > >
> > > >
> > > > > > I look forward to seeing the changes you mention -- they sound very
> > > > > > exciting.  But, they will still require both networking core and
> > > > > > driver changes in the kernel.  And as I understand things today,
> > > > > > the userland code will still need at least some knowledge of specific
> > > > > > devices and how they layout their packet descriptors, etc.  So while
> > > > > > those changes sound very promising, they will still have certain
> > > > > > drawbacks in common with the current situation.
> > > > >
> > > > > Yes, we would like the DPDK performance optimization techniques such as huge page, efficient rx/tx routines to manipulate
> > > > device-specific
> > > > > packet descriptors, polling-model can be still used. We have to tradeoff between performance and commonality. But we believe
> it will
> > > > be much easier
> > > > > to develop DPDK PMD for non-Intel NICs than porting entire kernel drivers to DPDK.
> > > > >
> > > >
> > > > Not sure how this relates, what you're describing is the feature intel has been
> > > > working on to augment kernel drivers to provide better throughput via direct
> > > > hardware access to user space.  Johns PMD provides ubiquitous function on all
> > > > hardware. I'm not sure how the desire for one implies the other isn't valuable?
> > > >
> > >
> > > Performance is the key value of DPDK, instead of commonality. But we are trying to improve commonality of our solution to make
> it easily
> > > adopted by other NIC vendors.
> > >
> > Thats completely irrelevant to the question at hand.  To go with your reasoning,
> > if performance is the key value of the DPDK, then you should remove all driver
> > support save for the most performant hardware you have.  By that same token,
> > you should deprecate the pcap driver in favor of this AF_PACKET driver, because
> > it has shown performance improvement.
> >
> > I'm being facetious, of course, but the facts remain: Lack of superior
> > performance from one PMD to the next does not immediately obviate the need for
> > one PMD over another, as they quite likely address differing needs.  As you note
> > the DPDK seeks performance as a key goal, but its an open source project, there
> > are other needs from other users in play here.  The AF_PACKET pmd provides
> > superior performance on linux platforms when hardware independence is required.
> > It differs from the pcap PMD as it uses features that are only available on the
> > Linux platform, so it stands to reason we should have both.
> 
> IMHO, the biggest deficiency in DPDK is the lack of apps.  Let's face
> it, no one really cares about running l2fwd except for testing the
> drivers.  What people want is applications.  Providing a PMD to use
> while developing an app without requiring specific hardware seems like
> a win to me.  The pcap PMD addresses some of that, but it is more of
> a stop-gap or special purpose thing (like for playing back captures).
> 

It is not true for network middle boxes which resolve L2/L3 packet processing problems(which is the main problem DPDK wants to resolve when it was born), 
but it might be truefor data center or endpoint applications that primarily focus on addressing L4-L7 packet processing problems, which
do not care about L2/L3 high throughput and packet latency very much, as system performance bottle-neck are in the L4-L7 routines.

> > > > > > It seems like the changes you mention will still need some sort of
> > > > > > AF_PACKET-based PMD driver.  Have you implemented that completely
> > > > > > separate from the code I already posted?  Or did you add that work
> > > > > > on top of mine?
> > > > > >
> > > > >
> > > > > For userland code, it certainly use some of your code related to raw rocket, but highly modified. A layer will be added into
> eth_dev
> > > > library to do device
> > > > > probe and support new socket options.
> > > > >
> > > >
> > > > Ok, but again, PMD's are independent, and serve different needs.  If they're use
> > > > is at all overlapping from a functional standpoint, take this one now, and
> > > > deprecate it when a better one comes along.  Though from your description it
> > > > seems like both have a valid place in the ecosystem.
> > > >
> > >
> > > I am ok with this approach, as long as this AF_PACKET PMD does not add extra maintain efforts. Thomas might make the call.
> > >
> > What extra maintainer efforts do you think are required here, that wouldn't be
> > required for any PMD?  To suggest that a given PMD shouldn't be included because
> > it would require additional effort to maintain holds it to a higher standard
> > than the PMD's already included.  I don't recall anyone asking if the i40e or
> > bonding pmds would require additional effort before being integrated.
> 
> Right -- how much maintainer effort is put into the pcap driver
> these days?

I do not know details, but I DO know validation guys need to put a lot efforts on measuring the performance for it on different platforms.
Probably a automation function and performance testsuite can help a lot.

> 
> John
> --
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.
  
Neil Horman Sept. 16, 2014, 8:16 p.m. UTC | #19
On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> Ping?  Are there objections to this patch from mid-July?
> 
> John
> 
Thomas, Where are you on this?  It seems like if you don't have any objections
to this patch, it should go in, in ilght of the lack of further commentary.

Neil
  
Thomas Monjalon Sept. 26, 2014, 9:28 a.m. UTC | #20
2014-09-16 16:16, Neil Horman:
> On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > Ping?  Are there objections to this patch from mid-July?
> 
> Thomas, Where are you on this?  It seems like if you don't have any objections
> to this patch, it should go in, in ilght of the lack of further commentary.

1) It doesn't appear as a top priority.
2) It's competing with pcap PMD and bifurcated PMD to come
   (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
3) There is no test associated with this PMD.
If one of this item becomes wrong, it should go in.

Currently, 2 projects are being initiated for validation (dcts) and
documentation. Keeping new things outside of the DPDK core makes it
clear that they have not to be supported by dcts and doc yet.
So, it is better to have an external PMD, like memnic, acting as a
staging area.

During this time, keeping this PMD separately will allow you to update it
with a maintainer account in dpdk.org. I just need your SSH public key.

Thank you
  
Neil Horman Sept. 26, 2014, 2:08 p.m. UTC | #21
On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> 2014-09-16 16:16, Neil Horman:
> > On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > > Ping?  Are there objections to this patch from mid-July?
> > 
> > Thomas, Where are you on this?  It seems like if you don't have any objections
> > to this patch, it should go in, in ilght of the lack of further commentary.
> 
> 1) It doesn't appear as a top priority.
Thats your responsibility.  Patches can't languish and rot on a list forever
just because others aren't willing to test it.  If theres further testing that
you feel it needs, ask. But from my read, its been tested for functionality and
performance (though high performance is never expected from a AF_PACKET PMD).
Given that any one PMD will not affect the performance of another in isolation,
I'm not sure what more you're waiting for here.

> 2) It's competing with pcap PMD and bifurcated PMD to come
>    (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
Regarding the pcap PMD, so?  Its an alternate implementation that provides
different features with different limitations.  The fact that they are simmilar
is irrelevant.  If simmilarity was the test, then we wouldn't bother with the
bifurcated driver either, because the pcap pmd already exists.

Regarding the bifurcated driver, you can't hold existing patches on the promise
of another pmd thats comming at an indeterminate time in the future.  Theres no
reason not to take this now and deprecate it in the future if there is
sufficient overlap with the bifurcated driver, though to my point above, they
still address different needs with different limitations, so I don't see doing
so as necessecary.
 
> 3) There is no test associated with this PMD.
That would have been a great comment to make a few months back, though whats
wrong with testpmd here?  That seems to be the same test that every other pmd
uses. What exactly are you looking for?


> If one of this item becomes wrong, it should go in.
> 

> Currently, 2 projects are being initiated for validation (dcts) and
> documentation. Keeping new things outside of the DPDK core makes it
> clear that they have not to be supported by dcts and doc yet.
> So, it is better to have an external PMD, like memnic, acting as a
> staging area.
> 
So, this brings up an excellent point - Validation and support.  Commonly open
source projects don't provide support at the upstream HEAD. Those items are
applied and inforced by distributors.  Theres no need to ensure that the
upstream head is always the most performance and stable point of the tree.  Its
that need that keeps the development pace slow, and creates frustrations like
this one, where a patch sits unaddressed for long periods of time.  Commonly the
workflow for most open source projects is for there to be a window of time where
visual review and basic functional testing are sufficient for acceptance into
the head of the tree.  After the development window closes there is a
stabilization period where testing/validation is done to ensure that no
regressions have been encountered, optionally with a -next branch temporarily
being created to accept patches for upcomming future releases.  If regressions
are found, its a simple matter in git to bisect back to the offending patch,
allow the contributing developer an opportunity to fix the issue, or to drop the
patch.  Using a workflow like this we can have a reasonable balance of needs
(good patch turn around time, as well as reasonable testing).  We've discussed
this when I posted the PMD_REGISTER_DRIVER patch months ago, and I thought you
were going to move in the direction of this workflow.  What happened?

> During this time, keeping this PMD separately will allow you to update it
> with a maintainer account in dpdk.org. I just need your SSH public key.
> 
We've discussed this too, keeping PMDs maintained separately is a very bad idea.
Doing so means developers have to constantly be aware of changes to the core
tree and try to keep up individually.  Integrating them all means that API
changes can be easily propogated to all PMD's when needed without making work
for many people.  Its exactly the reason we encourage driver writers to open
source drivers in Linux, because not doing so closes developers off from the
free maintenence they get when optimizations are made to API's.  And if you
follow the development model above, you don't need to worry about implied
support, as that correctly becomes a distributor issue.


Neil
  
Bruce Richardson Sept. 29, 2014, 10:05 a.m. UTC | #22
On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > 2014-09-16 16:16, Neil Horman:
> > > On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > > > Ping?  Are there objections to this patch from mid-July?
> > > 
> > > Thomas, Where are you on this?  It seems like if you don't have any objections
> > > to this patch, it should go in, in ilght of the lack of further commentary.
> > 
> > 1) It doesn't appear as a top priority.
> Thats your responsibility.  Patches can't languish and rot on a list forever
> just because others aren't willing to test it.  If theres further testing that
> you feel it needs, ask. But from my read, its been tested for functionality and
> performance (though high performance is never expected from a AF_PACKET PMD).
> Given that any one PMD will not affect the performance of another in isolation,
> I'm not sure what more you're waiting for here.
> 
> > 2) It's competing with pcap PMD and bifurcated PMD to come
> >    (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
> Regarding the pcap PMD, so?  Its an alternate implementation that provides
> different features with different limitations.  The fact that they are simmilar
> is irrelevant.  If simmilarity was the test, then we wouldn't bother with the
> bifurcated driver either, because the pcap pmd already exists.
> 
> Regarding the bifurcated driver, you can't hold existing patches on the promise
> of another pmd thats comming at an indeterminate time in the future.  Theres no
> reason not to take this now and deprecate it in the future if there is
> sufficient overlap with the bifurcated driver, though to my point above, they
> still address different needs with different limitations, so I don't see doing
> so as necessecary.
>  
> > 3) There is no test associated with this PMD.
> That would have been a great comment to make a few months back, though whats
> wrong with testpmd here?  That seems to be the same test that every other pmd
> uses. What exactly are you looking for?
> 
> 
> > If one of this item becomes wrong, it should go in.
> > 
> 
> > Currently, 2 projects are being initiated for validation (dcts) and
> > documentation. Keeping new things outside of the DPDK core makes it
> > clear that they have not to be supported by dcts and doc yet.
> > So, it is better to have an external PMD, like memnic, acting as a
> > staging area.
> > 
> So, this brings up an excellent point - Validation and support.  Commonly open
> source projects don't provide support at the upstream HEAD. Those items are
> applied and inforced by distributors.  Theres no need to ensure that the
> upstream head is always the most performance and stable point of the tree.  Its
> that need that keeps the development pace slow, and creates frustrations like
> this one, where a patch sits unaddressed for long periods of time.  Commonly the
> workflow for most open source projects is for there to be a window of time where
> visual review and basic functional testing are sufficient for acceptance into
> the head of the tree.  After the development window closes there is a
> stabilization period where testing/validation is done to ensure that no
> regressions have been encountered, optionally with a -next branch temporarily
> being created to accept patches for upcomming future releases.  If regressions
> are found, its a simple matter in git to bisect back to the offending patch,
> allow the contributing developer an opportunity to fix the issue, or to drop the
> patch.  Using a workflow like this we can have a reasonable balance of needs
> (good patch turn around time, as well as reasonable testing).  We've discussed
> this when I posted the PMD_REGISTER_DRIVER patch months ago, and I thought you
> were going to move in the direction of this workflow.  What happened?
> 
> > During this time, keeping this PMD separately will allow you to update it
> > with a maintainer account in dpdk.org. I just need your SSH public key.
> > 
> We've discussed this too, keeping PMDs maintained separately is a very bad idea.
> Doing so means developers have to constantly be aware of changes to the core
> tree and try to keep up individually.  Integrating them all means that API
> changes can be easily propogated to all PMD's when needed without making work
> for many people.  Its exactly the reason we encourage driver writers to open
> source drivers in Linux, because not doing so closes developers off from the
> free maintenence they get when optimizations are made to API's.  And if you
> follow the development model above, you don't need to worry about implied
> support, as that correctly becomes a distributor issue.
> 
> 
> Neil

While not wanting to get too involved in the discussion, I'd just like to 
express my support for getting this new PMD merged in.

/Bruce
  
Thomas Monjalon Oct. 8, 2014, 3:57 p.m. UTC | #23
2014-09-29 11:05, Bruce Richardson:
> On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > 2014-09-16 16:16, Neil Horman:
> > > > On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > > > > Ping?  Are there objections to this patch from mid-July?
> > > > 
> > > > Thomas, Where are you on this?  It seems like if you don't have any objections
> > > > to this patch, it should go in, in ilght of the lack of further commentary.
> > > 
> > > 1) It doesn't appear as a top priority.
> > Thats your responsibility.  Patches can't languish and rot on a list forever
> > just because others aren't willing to test it.  If theres further testing that
> > you feel it needs, ask. But from my read, its been tested for functionality and
> > performance (though high performance is never expected from a AF_PACKET PMD).
> > Given that any one PMD will not affect the performance of another in isolation,
> > I'm not sure what more you're waiting for here.

Yes, integration of new PMD must be accelerated.

> > > 2) It's competing with pcap PMD and bifurcated PMD to come
> > >    (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
> > Regarding the pcap PMD, so?  Its an alternate implementation that provides
> > different features with different limitations.  The fact that they are simmilar
> > is irrelevant.  If simmilarity was the test, then we wouldn't bother with the
> > bifurcated driver either, because the pcap pmd already exists.
> > 
> > Regarding the bifurcated driver, you can't hold existing patches on the promise
> > of another pmd thats comming at an indeterminate time in the future.  Theres no
> > reason not to take this now and deprecate it in the future if there is
> > sufficient overlap with the bifurcated driver, though to my point above, they
> > still address different needs with different limitations, so I don't see doing
> > so as necessecary.

Yes, we'll discuss it when bifurcated driver will be released.

> > > 3) There is no test associated with this PMD.
> > That would have been a great comment to make a few months back, though whats
> > wrong with testpmd here?  That seems to be the same test that every other pmd
> > uses. What exactly are you looking for?

I was thinking of testing behaviour with different kernel configurations and
unit tests for --vdev options. But it's not a major blocker.

> > > If one of this item becomes wrong, it should go in.
> > 
> > > Currently, 2 projects are being initiated for validation (dcts) and
> > > documentation. Keeping new things outside of the DPDK core makes it
> > > clear that they have not to be supported by dcts and doc yet.
> > > So, it is better to have an external PMD, like memnic, acting as a
> > > staging area.
> > > 
> > So, this brings up an excellent point - Validation and support.  Commonly open
> > source projects don't provide support at the upstream HEAD. Those items are
> > applied and inforced by distributors.  Theres no need to ensure that the
> > upstream head is always the most performance and stable point of the tree.  Its
> > that need that keeps the development pace slow, and creates frustrations like
> > this one, where a patch sits unaddressed for long periods of time.  Commonly the
> > workflow for most open source projects is for there to be a window of time where
> > visual review and basic functional testing are sufficient for acceptance into
> > the head of the tree.  After the development window closes there is a
> > stabilization period where testing/validation is done to ensure that no
> > regressions have been encountered, optionally with a -next branch temporarily
> > being created to accept patches for upcomming future releases.  If regressions
> > are found, its a simple matter in git to bisect back to the offending patch,
> > allow the contributing developer an opportunity to fix the issue, or to drop the
> > patch.  Using a workflow like this we can have a reasonable balance of needs
> > (good patch turn around time, as well as reasonable testing).  We've discussed
> > this when I posted the PMD_REGISTER_DRIVER patch months ago, and I thought you
> > were going to move in the direction of this workflow.  What happened?

Yes, we are moving to a "merge window" workflow.

> > > During this time, keeping this PMD separately will allow you to update it
> > > with a maintainer account in dpdk.org. I just need your SSH public key.
> > > 
> > We've discussed this too, keeping PMDs maintained separately is a very bad idea.
> > Doing so means developers have to constantly be aware of changes to the core
> > tree and try to keep up individually.  Integrating them all means that API
> > changes can be easily propogated to all PMD's when needed without making work
> > for many people.  Its exactly the reason we encourage driver writers to open
> > source drivers in Linux, because not doing so closes developers off from the
> > free maintenence they get when optimizations are made to API's.  And if you
> > follow the development model above, you don't need to worry about implied
> > support, as that correctly becomes a distributor issue.
> > 
> > 
> > Neil
> 
> While not wanting to get too involved in the discussion, I'd just like to 
> express my support for getting this new PMD merged in.

If RedHat is committed for its maintenance, it could integrated in release 1.8.
But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
pmd_packet.

Thanks
  
Neil Horman Oct. 8, 2014, 7:14 p.m. UTC | #24
On Wed, Oct 08, 2014 at 05:57:46PM +0200, Thomas Monjalon wrote:
> 2014-09-29 11:05, Bruce Richardson:
> > On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > > 2014-09-16 16:16, Neil Horman:
> > > > > On Fri, Sep 12, 2014 at 02:05:23PM -0400, John W. Linville wrote:
> > > > > > Ping?  Are there objections to this patch from mid-July?
> > > > > 
> > > > > Thomas, Where are you on this?  It seems like if you don't have any objections
> > > > > to this patch, it should go in, in ilght of the lack of further commentary.
> > > > 
> > > > 1) It doesn't appear as a top priority.
> > > Thats your responsibility.  Patches can't languish and rot on a list forever
> > > just because others aren't willing to test it.  If theres further testing that
> > > you feel it needs, ask. But from my read, its been tested for functionality and
> > > performance (though high performance is never expected from a AF_PACKET PMD).
> > > Given that any one PMD will not affect the performance of another in isolation,
> > > I'm not sure what more you're waiting for here.
> 
> Yes, integration of new PMD must be accelerated.
> 
> > > > 2) It's competing with pcap PMD and bifurcated PMD to come
> > > >    (http://dpdk.org/ml/archives/dev/2014-September/005379.html)
> > > Regarding the pcap PMD, so?  Its an alternate implementation that provides
> > > different features with different limitations.  The fact that they are simmilar
> > > is irrelevant.  If simmilarity was the test, then we wouldn't bother with the
> > > bifurcated driver either, because the pcap pmd already exists.
> > > 
> > > Regarding the bifurcated driver, you can't hold existing patches on the promise
> > > of another pmd thats comming at an indeterminate time in the future.  Theres no
> > > reason not to take this now and deprecate it in the future if there is
> > > sufficient overlap with the bifurcated driver, though to my point above, they
> > > still address different needs with different limitations, so I don't see doing
> > > so as necessecary.
> 
> Yes, we'll discuss it when bifurcated driver will be released.
> 
john Fastabend posted it to netdev just a few days ago. There have been some
concerns raised, which he is trying to address.  I'm watching how that goes.

> > > > 3) There is no test associated with this PMD.
> > > That would have been a great comment to make a few months back, though whats
> > > wrong with testpmd here?  That seems to be the same test that every other pmd
> > > uses. What exactly are you looking for?
> 
> I was thinking of testing behaviour with different kernel configurations and
> unit tests for --vdev options. But it's not a major blocker.
> 
Thats fine with me.  If theres a set of unit tests that you have documentation
for, I'm sure we would be happy to run them.  I presume you just want all the
pmd vdev option exercised?  Any specific sets of kernel configurations?


> > > > If one of this item becomes wrong, it should go in.
> > > 
> > > > Currently, 2 projects are being initiated for validation (dcts) and
> > > > documentation. Keeping new things outside of the DPDK core makes it
> > > > clear that they have not to be supported by dcts and doc yet.
> > > > So, it is better to have an external PMD, like memnic, acting as a
> > > > staging area.
> > > > 
> > > So, this brings up an excellent point - Validation and support.  Commonly open
> > > source projects don't provide support at the upstream HEAD. Those items are
> > > applied and inforced by distributors.  Theres no need to ensure that the
> > > upstream head is always the most performance and stable point of the tree.  Its
> > > that need that keeps the development pace slow, and creates frustrations like
> > > this one, where a patch sits unaddressed for long periods of time.  Commonly the
> > > workflow for most open source projects is for there to be a window of time where
> > > visual review and basic functional testing are sufficient for acceptance into
> > > the head of the tree.  After the development window closes there is a
> > > stabilization period where testing/validation is done to ensure that no
> > > regressions have been encountered, optionally with a -next branch temporarily
> > > being created to accept patches for upcomming future releases.  If regressions
> > > are found, its a simple matter in git to bisect back to the offending patch,
> > > allow the contributing developer an opportunity to fix the issue, or to drop the
> > > patch.  Using a workflow like this we can have a reasonable balance of needs
> > > (good patch turn around time, as well as reasonable testing).  We've discussed
> > > this when I posted the PMD_REGISTER_DRIVER patch months ago, and I thought you
> > > were going to move in the direction of this workflow.  What happened?
> 
> Yes, we are moving to a "merge window" workflow.
> 
That would be wonderful.  I think separating the integration workflow from the
test workflow is critical here to making sure that patch integration isn't
unnecessecarily delayed.

> > > > During this time, keeping this PMD separately will allow you to update it
> > > > with a maintainer account in dpdk.org. I just need your SSH public key.
> > > > 
> > > We've discussed this too, keeping PMDs maintained separately is a very bad idea.
> > > Doing so means developers have to constantly be aware of changes to the core
> > > tree and try to keep up individually.  Integrating them all means that API
> > > changes can be easily propogated to all PMD's when needed without making work
> > > for many people.  Its exactly the reason we encourage driver writers to open
> > > source drivers in Linux, because not doing so closes developers off from the
> > > free maintenence they get when optimizations are made to API's.  And if you
> > > follow the development model above, you don't need to worry about implied
> > > support, as that correctly becomes a distributor issue.
> > > 
> > > 
> > > Neil
> > 
> > While not wanting to get too involved in the discussion, I'd just like to 
> > express my support for getting this new PMD merged in.
> 
> If RedHat is committed for its maintenance, it could integrated in release 1.8.
> But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
> pmd_packet.
> 
John L. is on his way to plumbers at the moment, so is unable to comment, but
I'll try to get a few cycles to change the name of the PMD around.  And yes, I
thought that maintenance was implicit.  He's the author, of course he'll take
care of it :).  And I'll be glad to help

Neil

> Thanks
> -- 
> Thomas
>
  
Thomas Monjalon Nov. 13, 2014, 10:03 a.m. UTC | #25
Hi Neil and John,

I would like to wake up this very old thread.

2014-10-08 15:14, Neil Horman:
> On Wed, Oct 08, 2014 at 05:57:46PM +0200, Thomas Monjalon wrote:
> > 2014-09-29 11:05, Bruce Richardson:
> > > On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > > > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > > > 3) There is no test associated with this PMD.
> > > > That would have been a great comment to make a few months back, though whats
> > > > wrong with testpmd here?  That seems to be the same test that every other pmd
> > > > uses. What exactly are you looking for?
> > 
> > I was thinking of testing behaviour with different kernel configurations and
> > unit tests for --vdev options. But it's not a major blocker.
> > 
> Thats fine with me.  If theres a set of unit tests that you have documentation
> for, I'm sure we would be happy to run them.  I presume you just want all the
> pmd vdev option exercised?  Any specific sets of kernel configurations?

I don't really know which tests are needed. It could be a mix of unit tests
and functionnal tests described in a test plan.
The goal is to be able to validate the behaviour and check there is no
regression. Ideally some corner cases could be described.
I'm OK to integrate it as is. But future maintenance will probably need
such inputs for validation tests.

> > If RedHat is committed for its maintenance, it could integrated in release 1.8.
> > But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
> > pmd_packet.
> > 
> John L. is on his way to plumbers at the moment, so is unable to comment, but
> I'll try to get a few cycles to change the name of the PMD around.  And yes, I
> thought that maintenance was implicit.  He's the author, of course he'll take
> care of it :).  And I'll be glad to help

Do you have time in coming days to rebase and rename this PMD for inclusion
in 1.8.0 release?

Thanks
  
Neil Horman Nov. 13, 2014, 11:14 a.m. UTC | #26
On Thu, Nov 13, 2014 at 02:03:18AM -0800, Thomas Monjalon wrote:
> Hi Neil and John,
> 
> I would like to wake up this very old thread.
> 
> 2014-10-08 15:14, Neil Horman:
> > On Wed, Oct 08, 2014 at 05:57:46PM +0200, Thomas Monjalon wrote:
> > > 2014-09-29 11:05, Bruce Richardson:
> > > > On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > > > > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > > > > 3) There is no test associated with this PMD.
> > > > > That would have been a great comment to make a few months back, though whats
> > > > > wrong with testpmd here?  That seems to be the same test that every other pmd
> > > > > uses. What exactly are you looking for?
> > > 
> > > I was thinking of testing behaviour with different kernel configurations and
> > > unit tests for --vdev options. But it's not a major blocker.
> > > 
> > Thats fine with me.  If theres a set of unit tests that you have documentation
> > for, I'm sure we would be happy to run them.  I presume you just want all the
> > pmd vdev option exercised?  Any specific sets of kernel configurations?
> 
> I don't really know which tests are needed. It could be a mix of unit tests
> and functionnal tests described in a test plan.
> The goal is to be able to validate the behaviour and check there is no
> regression. Ideally some corner cases could be described.
> I'm OK to integrate it as is. But future maintenance will probably need
> such inputs for validation tests.
> 
Do you have an example set of tests that the other pmd's have followed for this?

> > > If RedHat is committed for its maintenance, it could integrated in release 1.8.
> > > But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
> > > pmd_packet.
> > > 
> > John L. is on his way to plumbers at the moment, so is unable to comment, but
> > I'll try to get a few cycles to change the name of the PMD around.  And yes, I
> > thought that maintenance was implicit.  He's the author, of course he'll take
> > care of it :).  And I'll be glad to help
> 
> Do you have time in coming days to rebase and rename this PMD for inclusion
> in 1.8.0 release?
> 
> Thanks
> -- 
> Thomas
>
  
Thomas Monjalon Nov. 13, 2014, 11:57 a.m. UTC | #27
2014-11-13 06:14, Neil Horman:
> On Thu, Nov 13, 2014 at 02:03:18AM -0800, Thomas Monjalon wrote:
> > 2014-10-08 15:14, Neil Horman:
> > > On Wed, Oct 08, 2014 at 05:57:46PM +0200, Thomas Monjalon wrote:
> > > > 2014-09-29 11:05, Bruce Richardson:
> > > > > On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > > > > > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > > > > > 3) There is no test associated with this PMD.
> > > > > > That would have been a great comment to make a few months back, though whats
> > > > > > wrong with testpmd here?  That seems to be the same test that every other pmd
> > > > > > uses. What exactly are you looking for?
> > > > 
> > > > I was thinking of testing behaviour with different kernel configurations and
> > > > unit tests for --vdev options. But it's not a major blocker.
> > > > 
> > > Thats fine with me.  If theres a set of unit tests that you have documentation
> > > for, I'm sure we would be happy to run them.  I presume you just want all the
> > > pmd vdev option exercised?  Any specific sets of kernel configurations?
> > 
> > I don't really know which tests are needed. It could be a mix of unit tests
> > and functionnal tests described in a test plan.
> > The goal is to be able to validate the behaviour and check there is no
> > regression. Ideally some corner cases could be described.
> > I'm OK to integrate it as is. But future maintenance will probably need
> > such inputs for validation tests.
> > 
> Do you have an example set of tests that the other pmd's have followed for this?

You can check this:
	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_test_plan.rst
	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_bonded_test_plan.rst

As I said, we can integrate AF_PACKET PMD without such test plan.
But we are going to improve testing of many areas in DPDK.

> > > > If RedHat is committed for its maintenance, it could integrated in release 1.8.
> > > > But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
> > > > pmd_packet.
> > > > 
> > > John L. is on his way to plumbers at the moment, so is unable to comment, but
> > > I'll try to get a few cycles to change the name of the PMD around.  And yes, I
> > > thought that maintenance was implicit.  He's the author, of course he'll take
> > > care of it :).  And I'll be glad to help
> > 
> > Do you have time in coming days to rebase and rename this PMD for inclusion
> > in 1.8.0 release?

Do you think a sub-tree with pull request model would help you for
maintenance of this PMD?
  
Neil Horman Nov. 14, 2014, 12:42 a.m. UTC | #28
On Thu, Nov 13, 2014 at 12:57:25PM +0100, Thomas Monjalon wrote:
> 2014-11-13 06:14, Neil Horman:
> > On Thu, Nov 13, 2014 at 02:03:18AM -0800, Thomas Monjalon wrote:
> > > 2014-10-08 15:14, Neil Horman:
> > > > On Wed, Oct 08, 2014 at 05:57:46PM +0200, Thomas Monjalon wrote:
> > > > > 2014-09-29 11:05, Bruce Richardson:
> > > > > > On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > > > > > > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > > > > > > 3) There is no test associated with this PMD.
> > > > > > > That would have been a great comment to make a few months back, though whats
> > > > > > > wrong with testpmd here?  That seems to be the same test that every other pmd
> > > > > > > uses. What exactly are you looking for?
> > > > > 
> > > > > I was thinking of testing behaviour with different kernel configurations and
> > > > > unit tests for --vdev options. But it's not a major blocker.
> > > > > 
> > > > Thats fine with me.  If theres a set of unit tests that you have documentation
> > > > for, I'm sure we would be happy to run them.  I presume you just want all the
> > > > pmd vdev option exercised?  Any specific sets of kernel configurations?
> > > 
> > > I don't really know which tests are needed. It could be a mix of unit tests
> > > and functionnal tests described in a test plan.
> > > The goal is to be able to validate the behaviour and check there is no
> > > regression. Ideally some corner cases could be described.
> > > I'm OK to integrate it as is. But future maintenance will probably need
> > > such inputs for validation tests.
> > > 
> > Do you have an example set of tests that the other pmd's have followed for this?
> 
> You can check this:
> 	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_test_plan.rst
> 	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_bonded_test_plan.rst
> 
> As I said, we can integrate AF_PACKET PMD without such test plan.
> But we are going to improve testing of many areas in DPDK.
> 
Thank you, I'll take a look in the AM

> > > > > If RedHat is committed for its maintenance, it could integrated in release 1.8.
> > > > > But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
> > > > > pmd_packet.
> > > > > 
> > > > John L. is on his way to plumbers at the moment, so is unable to comment, but
> > > > I'll try to get a few cycles to change the name of the PMD around.  And yes, I
> > > > thought that maintenance was implicit.  He's the author, of course he'll take
> > > > care of it :).  And I'll be glad to help
> > > 
> > > Do you have time in coming days to rebase and rename this PMD for inclusion
> > > in 1.8.0 release?
> 
> Do you think a sub-tree with pull request model would help you for
> maintenance of this PMD?
> 
I think thats a question for John to answer, but IMHO, I don't think the pmd
will have such patch volume that subtrees will be needed.

Neil

> -- 
> Thomas
>
  
John W. Linville Nov. 14, 2014, 2:45 p.m. UTC | #29
On Thu, Nov 13, 2014 at 07:42:08PM -0500, Neil Horman wrote:
> On Thu, Nov 13, 2014 at 12:57:25PM +0100, Thomas Monjalon wrote:
> > 2014-11-13 06:14, Neil Horman:
> > > On Thu, Nov 13, 2014 at 02:03:18AM -0800, Thomas Monjalon wrote:
> > > > 2014-10-08 15:14, Neil Horman:
> > > > > On Wed, Oct 08, 2014 at 05:57:46PM +0200, Thomas Monjalon wrote:

<snip>

> > > > > > If RedHat is committed for its maintenance, it could integrated in release 1.8.
> > > > > > But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
> > > > > > pmd_packet.
> > > > > > 
> > > > > John L. is on his way to plumbers at the moment, so is unable to comment, but
> > > > > I'll try to get a few cycles to change the name of the PMD around.  And yes, I
> > > > > thought that maintenance was implicit.  He's the author, of course he'll take
> > > > > care of it :).  And I'll be glad to help
> > > > 
> > > > Do you have time in coming days to rebase and rename this PMD for inclusion
> > > > in 1.8.0 release?
> > 
> > Do you think a sub-tree with pull request model would help you for
> > maintenance of this PMD?
> > 
> I think thats a question for John to answer, but IMHO, I don't think the pmd
> will have such patch volume that subtrees will be needed.

I haven't touched DPDK in a while, and I'm a bit busy...

When would you need it for 1.8.0?

John
  
Neil Horman Nov. 17, 2014, 11:19 a.m. UTC | #30
On Thu, Nov 13, 2014 at 12:57:25PM +0100, Thomas Monjalon wrote:
> 2014-11-13 06:14, Neil Horman:
> > On Thu, Nov 13, 2014 at 02:03:18AM -0800, Thomas Monjalon wrote:
> > > 2014-10-08 15:14, Neil Horman:
> > > > On Wed, Oct 08, 2014 at 05:57:46PM +0200, Thomas Monjalon wrote:
> > > > > 2014-09-29 11:05, Bruce Richardson:
> > > > > > On Fri, Sep 26, 2014 at 10:08:55AM -0400, Neil Horman wrote:
> > > > > > > On Fri, Sep 26, 2014 at 11:28:05AM +0200, Thomas Monjalon wrote:
> > > > > > > > 3) There is no test associated with this PMD.
> > > > > > > That would have been a great comment to make a few months back, though whats
> > > > > > > wrong with testpmd here?  That seems to be the same test that every other pmd
> > > > > > > uses. What exactly are you looking for?
> > > > > 
> > > > > I was thinking of testing behaviour with different kernel configurations and
> > > > > unit tests for --vdev options. But it's not a major blocker.
> > > > > 
> > > > Thats fine with me.  If theres a set of unit tests that you have documentation
> > > > for, I'm sure we would be happy to run them.  I presume you just want all the
> > > > pmd vdev option exercised?  Any specific sets of kernel configurations?
> > > 
> > > I don't really know which tests are needed. It could be a mix of unit tests
> > > and functionnal tests described in a test plan.
> > > The goal is to be able to validate the behaviour and check there is no
> > > regression. Ideally some corner cases could be described.
> > > I'm OK to integrate it as is. But future maintenance will probably need
> > > such inputs for validation tests.
> > > 
Apologies for the delay on this, its been a busy time lately.

> > Do you have an example set of tests that the other pmd's have followed for this?
> 
> You can check this:
> 	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_test_plan.rst
> 	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_bonded_test_plan.rst
> 
Looking at this, the pmd_test_plan above seems perfectly applicable to Johns
pmd.  did you feel as though additional tests were needed for a virutal pmd
(asside from a note describing the additional --vdev parameter required for
virtual device setup?

I'll have a renamed device pmd patch up later today.

Neil

> As I said, we can integrate AF_PACKET PMD without such test plan.
> But we are going to improve testing of many areas in DPDK.
> 
> > > > > If RedHat is committed for its maintenance, it could integrated in release 1.8.
> > > > > But I'd like it to be renamed as pmd_af_packet (or a better name) instead of
> > > > > pmd_packet.
> > > > > 
> > > > John L. is on his way to plumbers at the moment, so is unable to comment, but
> > > > I'll try to get a few cycles to change the name of the PMD around.  And yes, I
> > > > thought that maintenance was implicit.  He's the author, of course he'll take
> > > > care of it :).  And I'll be glad to help
> > > 
> > > Do you have time in coming days to rebase and rename this PMD for inclusion
> > > in 1.8.0 release?
> 
> Do you think a sub-tree with pull request model would help you for
> maintenance of this PMD?
> 
> -- 
> Thomas
>
  
Thomas Monjalon Nov. 17, 2014, 11:22 a.m. UTC | #31
2014-11-17 06:19, Neil Horman:
> On Thu, Nov 13, 2014 at 12:57:25PM +0100, Thomas Monjalon wrote:
> > 2014-11-13 06:14, Neil Horman:
> > > Do you have an example set of tests that the other pmd's have followed for this?
> > 
> > You can check this:
> > 	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_test_plan.rst
> > 	http://dpdk.org/browse/tools/dts/tree/test_plans/pmd_bonded_test_plan.rst
> > 
> Looking at this, the pmd_test_plan above seems perfectly applicable to Johns
> pmd.  did you feel as though additional tests were needed for a virutal pmd
> (asside from a note describing the additional --vdev parameter required for
> virtual device setup?

It's maybe sufficient. I didn't dig enough. We'll see wether some people need
more for validation.

> I'll have a renamed device pmd patch up later today.

Excellent.

Thanks
  

Patch

diff --git a/config/common_bsdapp b/config/common_bsdapp
index 943dce8f1ede..c317f031278e 100644
--- a/config/common_bsdapp
+++ b/config/common_bsdapp
@@ -226,6 +226,11 @@  CONFIG_RTE_LIBRTE_PMD_PCAP=y
 CONFIG_RTE_LIBRTE_PMD_BOND=y
 
 #
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=n
+
+#
 # Do prefetch of packet data within PMD driver receive function
 #
 CONFIG_RTE_PMD_PACKET_PREFETCH=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 7bf5d80d4e26..f9e7bc3015ec 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -249,6 +249,11 @@  CONFIG_RTE_LIBRTE_PMD_PCAP=n
 CONFIG_RTE_LIBRTE_PMD_BOND=y
 
 #
+# Compile software PMD backed by AF_PACKET sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_PACKET=y
+
+#
 # Compile Xen PMD
 #
 CONFIG_RTE_LIBRTE_PMD_XENVIRT=n
diff --git a/lib/Makefile b/lib/Makefile
index 10c5bb3045bc..930fadf29898 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -47,6 +47,7 @@  DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += librte_pmd_i40e
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_BOND) += librte_pmd_bond
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_RING) += librte_pmd_ring
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += librte_pmd_packet
 DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
 DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 756d6b0c9301..feed24a63272 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -44,6 +44,7 @@  CFLAGS += -I$(RTE_SDK)/lib/librte_ether
 CFLAGS += -I$(RTE_SDK)/lib/librte_ivshmem
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_ring
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_pcap
+CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_packet
 CFLAGS += -I$(RTE_SDK)/lib/librte_pmd_xenvirt
 CFLAGS += $(WERROR_FLAGS) -O3
 
diff --git a/lib/librte_pmd_packet/Makefile b/lib/librte_pmd_packet/Makefile
new file mode 100644
index 000000000000..e1266fb992cd
--- /dev/null
+++ b/lib/librte_pmd_packet/Makefile
@@ -0,0 +1,60 @@ 
+#   BSD LICENSE
+#
+#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   Copyright(c) 2014 6WIND S.A.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_pmd_packet.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += rte_eth_packet.c
+
+#
+# Export include files
+#
+SYMLINK-y-include += rte_eth_packet.h
+
+# this lib depends upon:
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_ether
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_malloc
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PACKET) += lib/librte_kvargs
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_pmd_packet/rte_eth_packet.c b/lib/librte_pmd_packet/rte_eth_packet.c
new file mode 100644
index 000000000000..9c82d16e730f
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.c
@@ -0,0 +1,826 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
+ *
+ *   Originally based upon librte_pmd_pcap code:
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2014 6WIND S.A.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_malloc.h>
+#include <rte_kvargs.h>
+#include <rte_dev.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+
+#include "rte_eth_packet.h"
+
+#define ETH_PACKET_IFACE_ARG		"iface"
+#define ETH_PACKET_NUM_Q_ARG		"qpairs"
+#define ETH_PACKET_BLOCKSIZE_ARG	"blocksz"
+#define ETH_PACKET_FRAMESIZE_ARG	"framesz"
+#define ETH_PACKET_FRAMECOUNT_ARG	"framecnt"
+
+#define DFLT_BLOCK_SIZE		(1 << 12)
+#define DFLT_FRAME_SIZE		(1 << 11)
+#define DFLT_FRAME_COUNT	(1 << 9)
+
+struct pkt_rx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framecount;
+	unsigned int framenum;
+
+	struct rte_mempool *mb_pool;
+
+	volatile unsigned long rx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pkt_tx_queue {
+	int sockfd;
+
+	struct iovec *rd;
+	uint8_t *map;
+	unsigned int framecount;
+	unsigned int framenum;
+
+	volatile unsigned long tx_pkts;
+	volatile unsigned long err_pkts;
+};
+
+struct pmd_internals {
+	unsigned nb_queues;
+
+	int if_index;
+	struct ether_addr eth_addr;
+
+	struct tpacket_req req;
+
+	struct pkt_rx_queue rx_queue[RTE_PMD_PACKET_MAX_RINGS];
+	struct pkt_tx_queue tx_queue[RTE_PMD_PACKET_MAX_RINGS];
+};
+
+static const char *valid_arguments[] = {
+	ETH_PACKET_IFACE_ARG,
+	ETH_PACKET_NUM_Q_ARG,
+	ETH_PACKET_BLOCKSIZE_ARG,
+	ETH_PACKET_FRAMESIZE_ARG,
+	ETH_PACKET_FRAMECOUNT_ARG,
+	NULL
+};
+
+static const char *drivername = "AF_PACKET PMD";
+
+static struct rte_eth_link pmd_link = {
+	.link_speed = 10000,
+	.link_duplex = ETH_LINK_FULL_DUPLEX,
+	.link_status = 0
+};
+
+static uint16_t
+eth_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	unsigned i;
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	struct pkt_rx_queue *pkt_q = queue;
+	uint16_t num_rx = 0;
+	unsigned int framecount, framenum;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	/*
+	 * Reads the given number of packets from the AF_PACKET socket one by
+	 * one and copies the packet data into a newly allocated mbuf.
+	 */
+	framecount = pkt_q->framecount;
+	framenum = pkt_q->framenum;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+		if ((ppd->tp_status & TP_STATUS_USER) == 0)
+			break;
+
+		/* allocate the next mbuf */
+		mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool);
+		if (unlikely(mbuf == NULL))
+			break;
+
+		/* packet will fit in the mbuf, go ahead and receive it */
+		mbuf->pkt.pkt_len = mbuf->pkt.data_len = ppd->tp_snaplen;
+		pbuf = (uint8_t *) ppd + ppd->tp_mac;
+		memcpy(mbuf->pkt.data, pbuf, mbuf->pkt.data_len);
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_KERNEL;
+		if (++framenum >= framecount)
+			framenum = 0;
+
+		/* account for the receive frame */
+		bufs[i] = mbuf;
+		num_rx++;
+	}
+	pkt_q->framenum = framenum;
+	pkt_q->rx_pkts += num_rx;
+	return num_rx;
+}
+
+/*
+ * Callback to handle sending packets through a real NIC.
+ */
+static uint16_t
+eth_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tpacket2_hdr *ppd;
+	struct rte_mbuf *mbuf;
+	uint8_t *pbuf;
+	unsigned int framecount, framenum;
+	struct pollfd pfd;
+	struct pkt_tx_queue *pkt_q = queue;
+	uint16_t num_tx = 0;
+	int i;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	memset(&pfd, 0, sizeof(pfd));
+	pfd.fd = pkt_q->sockfd;
+	pfd.events = POLLOUT;
+	pfd.revents = 0;
+
+	framecount = pkt_q->framecount;
+	framenum = pkt_q->framenum;
+	ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+	for (i = 0; i < nb_pkts; i++) {
+		/* point at the next incoming frame */
+		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
+		    (poll(&pfd, 1, -1) < 0))
+				continue;
+
+		/* copy the tx frame data */
+		mbuf = bufs[num_tx];
+		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
+			sizeof(struct sockaddr_ll);
+		memcpy(pbuf, mbuf->pkt.data, mbuf->pkt.data_len);
+		ppd->tp_len = ppd->tp_snaplen = mbuf->pkt.data_len;
+
+		/* release incoming frame and advance ring buffer */
+		ppd->tp_status = TP_STATUS_SEND_REQUEST;
+		if (++framenum >= framecount)
+			framenum = 0;
+		ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base;
+
+		num_tx++;
+		rte_pktmbuf_free(mbuf);
+	}
+
+	/* kick-off transmits */
+	sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+
+	pkt_q->framenum = framenum;
+	pkt_q->tx_pkts += num_tx;
+	pkt_q->err_pkts += nb_pkts - num_tx;
+	return num_tx;
+}
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = 1;
+	return 0;
+}
+
+/*
+ * This function gets called when the current port gets stopped.
+ */
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	int sockfd;
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	for (i = 0; i < internals->nb_queues; i++) {
+		sockfd = internals->rx_queue[i].sockfd;
+		if (sockfd != -1)
+			close(sockfd);
+		sockfd = internals->tx_queue[i].sockfd;
+		if (sockfd != -1)
+			close(sockfd);
+	}
+
+	dev->data->dev_link.link_status = 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev_info->driver_name = drivername;
+	dev_info->if_index = internals->if_index;
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
+	dev_info->max_rx_queues = (uint16_t)internals->nb_queues;
+	dev_info->max_tx_queues = (uint16_t)internals->nb_queues;
+	dev_info->min_rx_bufsize = 0;
+	dev_info->pci_dev = NULL;
+}
+
+static void
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
+{
+	unsigned i, imax;
+	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
+	const struct pmd_internals *internal = dev->data->dev_private;
+
+	memset(igb_stats, 0, sizeof(*igb_stats));
+
+	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
+		rx_total += igb_stats->q_ipackets[i];
+	}
+
+	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
+	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
+	for (i = 0; i < imax; i++) {
+		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
+		igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
+		tx_total += igb_stats->q_opackets[i];
+		tx_err_total += igb_stats->q_errors[i];
+	}
+
+	igb_stats->ipackets = rx_total;
+	igb_stats->opackets = tx_total;
+	igb_stats->oerrors = tx_err_total;
+}
+
+static void
+eth_stats_reset(struct rte_eth_dev *dev)
+{
+	unsigned i;
+	struct pmd_internals *internal = dev->data->dev_private;
+
+	for (i = 0; i < internal->nb_queues; i++)
+		internal->rx_queue[i].rx_pkts = 0;
+
+	for (i = 0; i < internal->nb_queues; i++) {
+		internal->tx_queue[i].tx_pkts = 0;
+		internal->tx_queue[i].err_pkts = 0;
+	}
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+{
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+                int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t rx_queue_id,
+                   uint16_t nb_rx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_rxconf *rx_conf __rte_unused,
+                   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	struct pkt_rx_queue *pkt_q = &internals->rx_queue[rx_queue_id];
+	struct rte_pktmbuf_pool_private *mbp_priv;
+	uint16_t buf_size;
+
+	pkt_q->mb_pool = mb_pool;
+
+	/* Now get the space available for data in the mbuf */
+	mbp_priv = rte_mempool_get_priv(pkt_q->mb_pool);
+	buf_size = (uint16_t) (mbp_priv->mbuf_data_room_size -
+	                       RTE_PKTMBUF_HEADROOM);
+
+	if (ETH_FRAME_LEN > buf_size) {
+		RTE_LOG(ERR, PMD,
+			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
+			dev->data->name, ETH_FRAME_LEN, buf_size);
+		return -ENOMEM;
+	}
+
+	dev->data->rx_queues[rx_queue_id] = pkt_q;
+
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev,
+                   uint16_t tx_queue_id,
+                   uint16_t nb_tx_desc __rte_unused,
+                   unsigned int socket_id __rte_unused,
+                   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id];
+	return 0;
+}
+
+static struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+	.stats_get = eth_stats_get,
+	.stats_reset = eth_stats_reset,
+};
+
+/*
+ * Opens an AF_PACKET socket
+ */
+static int
+open_packet_iface(const char *key __rte_unused,
+                  const char *value __rte_unused,
+                  void *extra_args)
+{
+	int *sockfd = extra_args;
+
+	/* Open an AF_PACKET socket... */
+	*sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	if (*sockfd == -1) {
+		RTE_LOG(ERR, PMD, "Could not open AF_PACKET socket\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+rte_pmd_init_internals(const char *name,
+                       const int sockfd,
+                       const unsigned nb_queues,
+                       unsigned int blocksize,
+                       unsigned int blockcnt,
+                       unsigned int framesize,
+                       unsigned int framecnt,
+                       const unsigned numa_node,
+                       struct pmd_internals **internals,
+                       struct rte_eth_dev **eth_dev,
+                       struct rte_kvargs *kvlist)
+{
+	struct rte_eth_dev_data *data = NULL;
+	struct rte_pci_device *pci_dev = NULL;
+	struct rte_kvargs_pair *pair = NULL;
+	struct ifreq ifr;
+	size_t ifnamelen;
+	unsigned k_idx;
+	struct sockaddr_ll sockaddr;
+	struct tpacket_req *req;
+	struct pkt_rx_queue *rx_queue;
+	struct pkt_tx_queue *tx_queue;
+	int rc, tpver, discard, bypass;
+	unsigned int i, q, rdsize;
+	int qsockfd, fanout_arg;
+
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_PACKET_IFACE_ARG) != NULL)
+			break;
+	}
+	if (pair == NULL) {
+		RTE_LOG(ERR, PMD,
+			"%s: no interface specified for AF_PACKET ethdev\n",
+		        name);
+		goto error;
+	}
+
+	RTE_LOG(INFO, PMD,
+		"%s: creating AF_PACKET-backed ethdev on numa socket %u\n",
+		name, numa_node);
+
+	/*
+	 * now do all data allocation - for eth_dev structure, dummy pci driver
+	 * and internal (private) data
+	 */
+	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
+	if (data == NULL)
+		goto error;
+
+	pci_dev = rte_zmalloc_socket(name, sizeof(*pci_dev), 0, numa_node);
+	if (pci_dev == NULL)
+		goto error;
+
+	*internals = rte_zmalloc_socket(name, sizeof(**internals),
+	                                0, numa_node);
+	if (*internals == NULL)
+		goto error;
+
+	req = &((*internals)->req);
+
+	req->tp_block_size = blocksize;
+	req->tp_block_nr = blockcnt;
+	req->tp_frame_size = framesize;
+	req->tp_frame_nr = framecnt;
+
+	ifnamelen = strlen(pair->value);
+	if (ifnamelen < sizeof(ifr.ifr_name)) {
+		memcpy(ifr.ifr_name, pair->value, ifnamelen);
+		ifr.ifr_name[ifnamelen] = '\0';
+	} else {
+		RTE_LOG(ERR, PMD,
+			"%s: I/F name too long (%s)\n",
+			name, pair->value);
+		goto error;
+	}
+	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"%s: ioctl failed (SIOCGIFINDEX)\n",
+		        name);
+		goto error;
+	}
+	(*internals)->if_index = ifr.ifr_ifindex;
+
+	if (ioctl(sockfd, SIOCGIFHWADDR, &ifr) == -1) {
+		RTE_LOG(ERR, PMD,
+			"%s: ioctl failed (SIOCGIFHWADDR)\n",
+		        name);
+		goto error;
+	}
+	memcpy(&(*internals)->eth_addr, ifr.ifr_hwaddr.sa_data, ETH_ALEN);
+
+	memset(&sockaddr, 0, sizeof(sockaddr));
+	sockaddr.sll_family = AF_PACKET;
+	sockaddr.sll_protocol = htons(ETH_P_ALL);
+	sockaddr.sll_ifindex = (*internals)->if_index;
+
+	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
+	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG |
+	               PACKET_FANOUT_FLAG_ROLLOVER) << 16;
+
+	for (q = 0; q < nb_queues; q++) {
+		/* Open an AF_PACKET socket for this queue... */
+		qsockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+		if (qsockfd == -1) {
+			RTE_LOG(ERR, PMD,
+			        "%s: could not open AF_PACKET socket\n",
+			        name);
+			return -1;
+		}
+
+		tpver = TPACKET_V2;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_VERSION,
+				&tpver, sizeof(tpver));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_VERSION on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		discard = 1;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_LOSS,
+				&discard, sizeof(discard));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_LOSS on "
+			        "AF_PACKET socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		bypass = 1;
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
+				&bypass, sizeof(bypass));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_QDISC_BYPASS "
+			        "on AF_PACKET socket for %s\n", name,
+			        pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_RX_RING, req, sizeof(*req));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_RX_RING on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_TX_RING, req, sizeof(*req));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_TX_RING on AF_PACKET "
+				"socket for %s\n", name, pair->value);
+			goto error;
+		}
+
+		rx_queue = &((*internals)->rx_queue[q]);
+		rx_queue->framecount = req->tp_frame_nr;
+
+		rx_queue->map = mmap(NULL, 2 * req->tp_block_size * req->tp_block_nr,
+				    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
+				    qsockfd, 0);
+		if (rx_queue->map == MAP_FAILED) {
+			RTE_LOG(ERR, PMD,
+				"%s: call to mmap failed on AF_PACKET socket for %s\n",
+				name, pair->value);
+			goto error;
+		}
+
+		/* rdsize is same for both Tx and Rx */
+		rdsize = req->tp_frame_nr * sizeof(*(rx_queue->rd));
+
+		rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
+		for (i = 0; i < req->tp_frame_nr; ++i) {
+			rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize);
+			rx_queue->rd[i].iov_len = req->tp_frame_size;
+		}
+		rx_queue->sockfd = qsockfd;
+
+		tx_queue = &((*internals)->tx_queue[q]);
+		tx_queue->framecount = req->tp_frame_nr;
+
+		tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr;
+
+		tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node);
+		for (i = 0; i < req->tp_frame_nr; ++i) {
+			tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize);
+			tx_queue->rd[i].iov_len = req->tp_frame_size;
+		}
+		tx_queue->sockfd = qsockfd;
+
+		rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not bind AF_PACKET socket to %s\n",
+			        name, pair->value);
+			goto error;
+		}
+
+		rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
+				&fanout_arg, sizeof(fanout_arg));
+		if (rc == -1) {
+			RTE_LOG(ERR, PMD,
+				"%s: could not set PACKET_FANOUT on AF_PACKET socket "
+				"for %s\n", name, pair->value);
+			goto error;
+		}
+	}
+
+	/* reserve an ethdev entry */
+	*eth_dev = rte_eth_dev_allocate(name);
+	if (*eth_dev == NULL)
+		goto error;
+
+	/*
+	 * now put it all together
+	 * - store queue data in internals,
+	 * - store numa_node info in pci_driver
+	 * - point eth_dev_data to internals and pci_driver
+	 * - and point eth_dev structure to new eth_dev_data structure
+	 */
+
+	(*internals)->nb_queues = nb_queues;
+
+	data->dev_private = *internals;
+	data->port_id = (*eth_dev)->data->port_id;
+	data->nb_rx_queues = (uint16_t)nb_queues;
+	data->nb_tx_queues = (uint16_t)nb_queues;
+	data->dev_link = pmd_link;
+	data->mac_addrs = &(*internals)->eth_addr;
+
+	pci_dev->numa_node = numa_node;
+
+	(*eth_dev)->data = data;
+	(*eth_dev)->dev_ops = &ops;
+	(*eth_dev)->pci_dev = pci_dev;
+
+	return 0;
+
+error:
+	if (data)
+		rte_free(data);
+	if (pci_dev)
+		rte_free(pci_dev);
+	for (q = 0; q < nb_queues; q++) {
+		if ((*internals)->rx_queue[q].rd)
+			rte_free((*internals)->rx_queue[q].rd);
+		if ((*internals)->tx_queue[q].rd)
+			rte_free((*internals)->tx_queue[q].rd);
+	}
+	if (*internals)
+		rte_free(*internals);
+	return -1;
+}
+
+static int
+rte_eth_from_packet(const char *name,
+                    int const *sockfd,
+                    const unsigned numa_node,
+                    struct rte_kvargs *kvlist)
+{
+	struct pmd_internals *internals = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	struct rte_kvargs_pair *pair = NULL;
+	unsigned k_idx;
+	unsigned int blockcount;
+	unsigned int blocksize = DFLT_BLOCK_SIZE;
+	unsigned int framesize = DFLT_FRAME_SIZE;
+	unsigned int framecount = DFLT_FRAME_COUNT;
+	unsigned int qpairs = 1;
+
+	/* do some parameter checking */
+	if (*sockfd < 0)
+		return -1;
+
+	/*
+	 * Walk arguments for configurable settings
+	 */
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_PACKET_NUM_Q_ARG) != NULL) {
+			qpairs = atoi(pair->value);
+			if (qpairs < 1 ||
+			    qpairs > RTE_PMD_PACKET_MAX_RINGS) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid qpairs value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_BLOCKSIZE_ARG) != NULL) {
+			blocksize = atoi(pair->value);
+			if (!blocksize) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid blocksize value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_FRAMESIZE_ARG) != NULL) {
+			framesize = atoi(pair->value);
+			if (!framesize) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid framesize value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_PACKET_FRAMECOUNT_ARG) != NULL) {
+			framecount = atoi(pair->value);
+			if (!framecount) {
+				RTE_LOG(ERR, PMD,
+					"%s: invalid framecount value\n",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+	}
+
+	if (framesize > blocksize) {
+		RTE_LOG(ERR, PMD,
+			"%s: AF_PACKET MMAP frame size exceeds block size!\n",
+		        name);
+		return -1;
+	}
+
+	blockcount = framecount / (blocksize / framesize);
+	if (!blockcount) {
+		RTE_LOG(ERR, PMD,
+			"%s: invalid AF_PACKET MMAP parameters\n", name);
+		return -1;
+	}
+
+	RTE_LOG(INFO, PMD, "%s: AF_PACKET MMAP parameters:\n", name);
+	RTE_LOG(INFO, PMD, "%s:\tblock size %d\n", name, blocksize);
+	RTE_LOG(INFO, PMD, "%s:\tblock count %d\n", name, blockcount);
+	RTE_LOG(INFO, PMD, "%s:\tframe size %d\n", name, framesize);
+	RTE_LOG(INFO, PMD, "%s:\tframe count %d\n", name, framecount);
+
+	if (rte_pmd_init_internals(name, *sockfd, qpairs,
+	                           blocksize, blockcount,
+	                           framesize, framecount,
+	                           numa_node, &internals, &eth_dev,
+	                           kvlist) < 0)
+		return -1;
+
+	eth_dev->rx_pkt_burst = eth_packet_rx;
+	eth_dev->tx_pkt_burst = eth_packet_tx;
+
+	return 0;
+}
+
+int
+rte_pmd_packet_devinit(const char *name, const char *params)
+{
+	unsigned numa_node;
+	int ret;
+	struct rte_kvargs *kvlist;
+	int sockfd = -1;
+
+	RTE_LOG(INFO, PMD, "Initializing pmd_packet for %s\n", name);
+
+	numa_node = rte_socket_id();
+
+	kvlist = rte_kvargs_parse(params, valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	/*
+	 * If iface argument is passed we open the NICs and use them for
+	 * reading / writing
+	 */
+	if (rte_kvargs_count(kvlist, ETH_PACKET_IFACE_ARG) == 1) {
+
+		ret = rte_kvargs_process(kvlist, ETH_PACKET_IFACE_ARG,
+		                         &open_packet_iface, &sockfd);
+		if (ret < 0)
+			return -1;
+	}
+
+	ret = rte_eth_from_packet(name, &sockfd, numa_node, kvlist);
+	close(sockfd); /* no longer needed */
+
+	if (ret < 0)
+		return -1;
+
+	return 0;
+}
+
+static struct rte_driver pmd_packet_drv = {
+	.name = "eth_packet",
+	.type = PMD_VDEV,
+	.init = rte_pmd_packet_devinit,
+};
+
+PMD_REGISTER_DRIVER(pmd_packet_drv);
diff --git a/lib/librte_pmd_packet/rte_eth_packet.h b/lib/librte_pmd_packet/rte_eth_packet.h
new file mode 100644
index 000000000000..f685611da3e9
--- /dev/null
+++ b/lib/librte_pmd_packet/rte_eth_packet.h
@@ -0,0 +1,55 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ETH_PACKET_H_
+#define _RTE_ETH_PACKET_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define RTE_ETH_PACKET_PARAM_NAME "eth_packet"
+
+#define RTE_PMD_PACKET_MAX_RINGS 16
+
+/**
+ * For use by the EAL only. Called as part of EAL init to set up any dummy NICs
+ * configured on command line.
+ */
+int rte_pmd_packet_devinit(const char *name, const char *params);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 34dff2a02a05..a6994c4dbe93 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -210,6 +210,10 @@  ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
 LDLIBS += -lrte_pmd_pcap -lpcap
 endif
 
+ifeq ($(CONFIG_RTE_LIBRTE_PMD_PACKET),y)
+LDLIBS += -lrte_pmd_packet
+endif
+
 endif # plugins
 
 LDLIBS += $(EXECENV_LDLIBS)