diff mbox series

[v1,2/2] linux/kni: Added support for KNI multiple fifos

Message ID 1607642153-24347-1-git-send-email-dheemanthm@vmware.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers show
Series None | expand

Checks

Context Check Description
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/checkpatch warning coding style issues

Commit Message

Dheemanth Mallikarjun Dec. 10, 2020, 11:15 p.m. UTC
In order to improve performance, the KNI is made to
support multiple fifos, So that multiple threads pinned
to multiple cores can process packets in parallel.

Signed-off-by: dheemanth <dheemanthm@vmware.com>
---
 app/test/test_kni.c             |   4 +-
 drivers/net/kni/rte_eth_kni.c   |   5 +-
 examples/kni/main.c             |   4 +-
 kernel/linux/kni/kni_dev.h      |  11 +-
 kernel/linux/kni/kni_fifo.h     | 190 ++++++++++++++++++++++++++++++-----
 kernel/linux/kni/kni_misc.c     | 189 +++++++++++++++++++++--------------
 kernel/linux/kni/kni_net.c      |  88 ++++++++++------
 lib/librte_kni/rte_kni.c        | 216 ++++++++++++++++++++++++++--------------
 lib/librte_kni/rte_kni.h        |  11 +-
 lib/librte_kni/rte_kni_common.h |  10 +-
 lib/librte_port/rte_port_kni.c  |  12 +--
 11 files changed, 514 insertions(+), 226 deletions(-)

Comments

Thomas Monjalon Jan. 17, 2021, 10:02 p.m. UTC | #1
11/12/2020 00:15, dheemanth:
> In order to improve performance, the KNI is made to
> support multiple fifos, So that multiple threads pinned
> to multiple cores can process packets in parallel.
> 
> Signed-off-by: dheemanth <dheemanthm@vmware.com>

Is there a patch 1/2?

Cc Ferruh, KNI maintainer.
Dheemanth Mallikarjun Jan. 19, 2021, 5:41 p.m. UTC | #2
No, this is the only patch.

Regards,
Dheemanth

On 1/17/21, 2:02 PM, "Thomas Monjalon" <thomas@monjalon.net> wrote:

    11/12/2020 00:15, dheemanth:
    > In order to improve performance, the KNI is made to
    > support multiple fifos, So that multiple threads pinned
    > to multiple cores can process packets in parallel.
    > 
    > Signed-off-by: dheemanth <dheemanthm@vmware.com>

    Is there a patch 1/2?

    Cc Ferruh, KNI maintainer.
Ferruh Yigit June 28, 2021, 4:58 p.m. UTC | #3
On 12/10/2020 11:15 PM, dheemanth wrote:
> In order to improve performance, the KNI is made to
> support multiple fifos, So that multiple threads pinned
> to multiple cores can process packets in parallel.
> 
> Signed-off-by: dheemanth <dheemanthm@vmware.com>

Hi Dheemanth,

I didn't check the patch yet but as a very high level comment,
it is possible to create multiple KNI interface and use multiple cores for each,
instead of multiple FIFO in a single interface. KNI example uses this approach.
Did you investigate this approach? What is the benefit of multiple FIFO against
multiple KNI interface?

Thanks,
ferruh
Ferruh Yigit June 30, 2021, 12:23 p.m. UTC | #4
On 12/10/2020 11:15 PM, dheemanth wrote:
> In order to improve performance, the KNI is made to
> support multiple fifos, So that multiple threads pinned
> to multiple cores can process packets in parallel.
> 

Hi Dheemanth,

As far as I know, in KNI the bottle neck is in the kernel thread. In this patch
FIFO between userspace and kernelspace converted into multiple FIFOs but in
kernel side still same thread process all FIFOs, so only userspace can scale to
more cores, I wonder how this imporves the performance, can you please share use
case and some numbers?

Also FIFOs seems converted from simple single producer, single consumer to multi
producer and multi consumer. What is the performance impact of this? And why
this is needed? In the dpdk application, is there N-N relation between cores and
fifos, again can you please clarifiy your usecase?

In the kernel to userspace transfer, packets distributed to multiple FIFOs based
on packet hash, this should be additional load to the kernel thread.

The sample application and unit test (also kni pmd) is not using this new
feature but they only use single fifo. They also should be updated to use this
feature, that helps as sample and helps to demonstrade the usecase.

Btw, can you please clarify why 'queues_num' is used? Is it expected to be same
with 'fifos_num'?

Also documentation needs to be updated, but before more change I think the
benefit of the work needs to be clarified to decide to proceed or not with the set.

> Signed-off-by: dheemanth <dheemanthm@vmware.com>
> ---
>  app/test/test_kni.c             |   4 +-
>  drivers/net/kni/rte_eth_kni.c   |   5 +-
>  examples/kni/main.c             |   4 +-
>  kernel/linux/kni/kni_dev.h      |  11 +-
>  kernel/linux/kni/kni_fifo.h     | 190 ++++++++++++++++++++++++++++++-----
>  kernel/linux/kni/kni_misc.c     | 189 +++++++++++++++++++++--------------
>  kernel/linux/kni/kni_net.c      |  88 ++++++++++------
>  lib/librte_kni/rte_kni.c        | 216 ++++++++++++++++++++++++++--------------
>  lib/librte_kni/rte_kni.h        |  11 +-
>  lib/librte_kni/rte_kni_common.h |  10 +-
>  lib/librte_port/rte_port_kni.c  |  12 +--
>  11 files changed, 514 insertions(+), 226 deletions(-)
> 

<...>

> @@ -292,51 +292,69 @@ kni_ioctl_create(struct net *net, uint32_t ioctl_num,
>  {
>  	struct kni_net *knet = net_generic(net, kni_net_id);
>  	int ret;
> -	struct rte_kni_device_info dev_info;
> +	unsigned int i, tx_queues_num;
> +	struct rte_kni_device_info *dev_info;
>  	struct net_device *net_dev = NULL;
>  	struct kni_dev *kni, *dev, *n;
>  
>  	pr_info("Creating kni...\n");
> +
> +	/* allocate dev_info from stack to avoid Wframe-larger-than=1024
> +	 * compile error.
> +	 */

s/stack/heap

> +	dev_info = kzalloc(sizeof(struct rte_kni_device_info), GFP_KERNEL);
> +	if (!dev_info)
> +		return -ENOMEM;
> +

<...>
diff mbox series

Patch

diff --git a/app/test/test_kni.c b/app/test/test_kni.c
index f53a53e..9bbceab 100644
--- a/app/test/test_kni.c
+++ b/app/test/test_kni.c
@@ -245,7 +245,7 @@  test_kni_loop(__rte_unused void *arg)
 			}
 
 			num = rte_kni_tx_burst(test_kni_ctx, pkts_burst,
-								nb_rx);
+						nb_rx, 0);
 			stats.ingress += num;
 			rte_kni_handle_request(test_kni_ctx);
 			if (num < nb_rx) {
@@ -260,7 +260,7 @@  test_kni_loop(__rte_unused void *arg)
 			if (test_kni_processing_flag)
 				break;
 			num = rte_kni_rx_burst(test_kni_ctx, pkts_burst,
-							PKT_BURST_SZ);
+						PKT_BURST_SZ, 0);
 			stats.egress += num;
 			for (nb_tx = 0; nb_tx < num; nb_tx++)
 				rte_pktmbuf_free(pkts_burst[nb_tx]);
diff --git a/drivers/net/kni/rte_eth_kni.c b/drivers/net/kni/rte_eth_kni.c
index 1696787..55711c5 100644
--- a/drivers/net/kni/rte_eth_kni.c
+++ b/drivers/net/kni/rte_eth_kni.c
@@ -81,7 +81,7 @@  eth_kni_rx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
 	uint16_t nb_pkts;
 	int i;
 
-	nb_pkts = rte_kni_rx_burst(kni, bufs, nb_bufs);
+	nb_pkts = rte_kni_rx_burst(kni, bufs, nb_bufs, 0);
 	for (i = 0; i < nb_pkts; i++)
 		bufs[i]->port = kni_q->internals->port_id;
 
@@ -97,7 +97,7 @@  eth_kni_tx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
 	struct rte_kni *kni = kni_q->internals->kni;
 	uint16_t nb_pkts;
 
-	nb_pkts =  rte_kni_tx_burst(kni, bufs, nb_bufs);
+	nb_pkts =  rte_kni_tx_burst(kni, bufs, nb_bufs, 0);
 
 	kni_q->tx.pkts += nb_pkts;
 
@@ -129,6 +129,7 @@  eth_kni_start(struct rte_eth_dev *dev)
 
 	mb_pool = internals->rx_queues[0].mb_pool;
 	strlcpy(conf.name, name, RTE_KNI_NAMESIZE);
+	memset(&conf, 0, sizeof(conf));
 	conf.force_bind = 0;
 	conf.group_id = port_id;
 	conf.mbuf_size =
diff --git a/examples/kni/main.c b/examples/kni/main.c
index fe93b86..a34bf1a 100644
--- a/examples/kni/main.c
+++ b/examples/kni/main.c
@@ -229,7 +229,7 @@  kni_ingress(struct kni_port_params *p)
 			return;
 		}
 		/* Burst tx to kni */
-		num = rte_kni_tx_burst(p->kni[i], pkts_burst, nb_rx);
+		num = rte_kni_tx_burst(p->kni[i], pkts_burst, nb_rx, 0);
 		if (num)
 			kni_stats[port_id].rx_packets += num;
 
@@ -261,7 +261,7 @@  kni_egress(struct kni_port_params *p)
 	port_id = p->port_id;
 	for (i = 0; i < nb_kni; i++) {
 		/* Burst rx from kni */
-		num = rte_kni_rx_burst(p->kni[i], pkts_burst, PKT_BURST_SZ);
+		num = rte_kni_rx_burst(p->kni[i], pkts_burst, PKT_BURST_SZ, 0);
 		if (unlikely(num > PKT_BURST_SZ)) {
 			RTE_LOG(ERR, APP, "Error receiving from KNI\n");
 			return;
diff --git a/kernel/linux/kni/kni_dev.h b/kernel/linux/kni/kni_dev.h
index c15da311..f782ec1 100644
--- a/kernel/linux/kni/kni_dev.h
+++ b/kernel/linux/kni/kni_dev.h
@@ -55,16 +55,16 @@  struct kni_dev {
 	struct net_device *net_dev;
 
 	/* queue for packets to be sent out */
-	struct rte_kni_fifo *tx_q;
+	struct rte_kni_fifo *tx_q[RTE_MAX_LCORE];
 
 	/* queue for the packets received */
-	struct rte_kni_fifo *rx_q;
+	struct rte_kni_fifo *rx_q[RTE_MAX_LCORE];
 
 	/* queue for the allocated mbufs those can be used to save sk buffs */
-	struct rte_kni_fifo *alloc_q;
+	struct rte_kni_fifo *alloc_q[RTE_MAX_LCORE];
 
 	/* free queue for the mbufs to be freed */
-	struct rte_kni_fifo *free_q;
+	struct rte_kni_fifo *free_q[RTE_MAX_LCORE];
 
 	/* request queue */
 	struct rte_kni_fifo *req_q;
@@ -87,6 +87,9 @@  struct kni_dev {
 	void *alloc_pa[MBUF_BURST_SZ];
 	void *alloc_va[MBUF_BURST_SZ];
 
+	unsigned int queues_num;
+	unsigned int fifos_num;
+
 	struct task_struct *usr_tsk;
 };
 
diff --git a/kernel/linux/kni/kni_fifo.h b/kernel/linux/kni/kni_fifo.h
index 5c91b55..6695129 100644
--- a/kernel/linux/kni/kni_fifo.h
+++ b/kernel/linux/kni/kni_fifo.h
@@ -18,48 +18,186 @@ 
 
 /**
  * Adds num elements into the fifo. Return the number actually written
+ * Multiple-producer safe based on  __rte_ring_mp_do_enqueue().
  */
-static inline uint32_t
-kni_fifo_put(struct rte_kni_fifo *fifo, void **data, uint32_t num)
+
+static inline unsigned
+kni_mp_fifo_put(struct rte_kni_fifo *fifo, void **data, unsigned int n)
 {
-	uint32_t i = 0;
-	uint32_t fifo_write = fifo->write;
-	uint32_t fifo_read = smp_load_acquire(&fifo->read);
-	uint32_t new_write = fifo_write;
+	unsigned int fifo_write, new_write;
+	unsigned int fifo_read, free_entries;
+	const unsigned int max = n;
+	int success = 0;
+	unsigned int i;
+	const unsigned int mask = (fifo->len) - 1;
+	unsigned int idx;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Move fifo->write.head atomically. */
+	do {
+		/* Reset n to the initial burst count. */
+		n = max;
+
+		fifo_write = fifo->write;
+		fifo_read = fifo->read;
+
+		/* The subtraction is done between two unsigned 32bits value
+		 * (the result is always modulo 32 bits even if we have
+		 * fifo_write > fifo_read). So 'free_entries' is always
+		 * between 0 and fifo->len-1.
+		 */
+		free_entries = mask + fifo_read - fifo_write;
 
-	for (i = 0; i < num; i++) {
-		new_write = (new_write + 1) & (fifo->len - 1);
+		/* Check that we have enough room in fifo. */
+		if (unlikely(n > free_entries)) {
+			if (unlikely(free_entries == 0))
+				return 0;
+			n = free_entries;
+		}
 
-		if (new_write == fifo_read)
-			break;
-		fifo->buffer[fifo_write] = data[i];
-		fifo_write = new_write;
+		new_write = fifo_write + n;
+		if (cmpxchg(&fifo->write, fifo_write, new_write) == fifo_write)
+			success = 1;
+
+	} while (unlikely(success == 0));
+ 
+	/* Write entries in fifo. */
+	idx = fifo_write & mask;
+	if (likely(idx + n < fifo->len)) {
+		for (i = 0; i < (n & (~0x3)); i += 4, idx += 4) {
+			fifo->buffer[idx] = data[i];
+			fifo->buffer[idx + 1] = data[i + 1];
+			fifo->buffer[idx + 2] = data[i + 2];
+			fifo->buffer[idx + 3] = data[i + 3];
+		}
+		switch (n & 0x3) {
+		case 3:
+			fifo->buffer[idx++] = data[i++];
+		case 2:
+			fifo->buffer[idx++] = data[i++];
+		case 1:
+			fifo->buffer[idx++] = data[i++];
+		}      
+	} else {
+		for (i = 0; i < n; i++)
+			fifo->buffer[(fifo_write + i) & mask] = data[i];
 	}
-	smp_store_release(&fifo->write, fifo_write);
+ 
+	/* barrier required to have ordered value for fifo write and read */
+	mb();
 
-	return i;
+	/* If there are other enqueues in progress that preceded us,
+	 * we need to wait for them to complete.
+	 */
+	while (unlikely(fifo->write != fifo_write))
+		cpu_relax();
+ 
+	fifo->write = new_write;
+	return n;
 }
 
 /**
- * Get up to num elements from the fifo. Return the number actully read
+ * Adds num elements into the fifo. Return the number actually written
  */
 static inline uint32_t
-kni_fifo_get(struct rte_kni_fifo *fifo, void **data, uint32_t num)
+kni_fifo_put(struct rte_kni_fifo *fifo, void **data, uint32_t num)
 {
-	uint32_t i = 0;
-	uint32_t new_read = fifo->read;
-	uint32_t fifo_write = smp_load_acquire(&fifo->write);
+	return kni_mp_fifo_put(fifo, data, num);
+}
 
-	for (i = 0; i < num; i++) {
-		if (new_read == fifo_write)
-			break;
+/**
+ * Get up to num elements from the fifo. Return the number actually read.
+ *
+ * Multiple-consumer safe based on __rte_ring_mc_do_dequeue().
+ */
+static inline uint32_t
+kni_mc_fifo_get(struct rte_kni_fifo *fifo, void **data, unsigned int n)
+{    
+	unsigned int fifo_read, fifo_write;
+	unsigned int new_read, entries;
+	const unsigned int max = n;
+	int success = 0;
+	unsigned int i;
+	unsigned int mask = (fifo->len) - 1;
+	unsigned int idx;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Move fifo->read.head atomically. */
+	do {
+		/* Restore n as it may change every loop. */
+		n = max;
+    
+		fifo_read = fifo->read;
+		fifo_write = fifo->write;
+              
+		/* The subtraction is done between two unsigned 32bits value
+		 * (the result is always modulo 32 bits even if we have
+		 * fifo_read > fifo_write). So 'entries' is always between 0
+		 * and fifo->len-1.
+		 */
+		entries = fifo_write - fifo_read;
+
+		/* Set the actual entries for dequeue. */
+		if (n > entries) {
+			if (unlikely(entries == 0))
+				return 0;
+			n = entries;
+		}
 
-		data[i] = fifo->buffer[new_read];
-		new_read = (new_read + 1) & (fifo->len - 1);
+		new_read = fifo_read + n;
+		if (cmpxchg(&fifo->read, fifo_read, new_read) == fifo_read)
+			success = 1;
+
+	} while (unlikely(success == 0));
+
+	/* Copy entries from fifo. */
+	idx = fifo_read & mask;
+	if (likely(idx + n < fifo->len)) {
+		for (i = 0; i < (n & (~0x3)); i += 4, idx += 4) {
+			data[i] = fifo->buffer[idx];
+			data[i + 1] = fifo->buffer[idx + 1];
+			data[i + 2] = fifo->buffer[idx + 2];
+			data[i + 3] = fifo->buffer[idx + 3];
+		}
+		switch (n & 0x3) {
+		case 3:
+			data[i++] = fifo->buffer[idx++];
+		case 2:
+			data[i++] = fifo->buffer[idx++];
+		case 1:
+			data[i++] = fifo->buffer[idx++];
+		}
+	} else {
+		for (i = 0; i < n; i++)
+			data[i] = fifo->buffer[(fifo_read + i) & mask];
 	}
-	smp_store_release(&fifo->read, new_read);
 
-	return i;
+	/* barrier required to have ordered value for fifo write and read */
+	mb();
+
+	/*
+	 * If there are other dequeues in progress that preceded us,
+	 * we need to wait for them to complete.
+	 */
+	while (unlikely(fifo->read != fifo_read))
+		cpu_relax();
+        
+	fifo->read = new_read;
+	return n;
+}
+
+
+/**
+ * Get up to num elements from the fifo. Return the number actually read
+ */
+static inline uint32_t
+kni_fifo_get(struct rte_kni_fifo *fifo, void **data, uint32_t num)
+{
+	return kni_mc_fifo_get(fifo, data, num);
 }
 
 /**
diff --git a/kernel/linux/kni/kni_misc.c b/kernel/linux/kni/kni_misc.c
index 2b464c4..a6abde7 100644
--- a/kernel/linux/kni/kni_misc.c
+++ b/kernel/linux/kni/kni_misc.c
@@ -292,51 +292,69 @@  kni_ioctl_create(struct net *net, uint32_t ioctl_num,
 {
 	struct kni_net *knet = net_generic(net, kni_net_id);
 	int ret;
-	struct rte_kni_device_info dev_info;
+	unsigned int i, tx_queues_num;
+	struct rte_kni_device_info *dev_info;
 	struct net_device *net_dev = NULL;
 	struct kni_dev *kni, *dev, *n;
 
 	pr_info("Creating kni...\n");
+
+	/* allocate dev_info from stack to avoid Wframe-larger-than=1024
+	 * compile error.
+	 */
+	dev_info = kzalloc(sizeof(struct rte_kni_device_info), GFP_KERNEL);
+	if (!dev_info)
+		return -ENOMEM;
+
 	/* Check the buffer size, to avoid warning */
-	if (_IOC_SIZE(ioctl_num) > sizeof(dev_info))
-		return -EINVAL;
+	if (_IOC_SIZE(ioctl_num) > sizeof(*dev_info)) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	/* Copy kni info from user space */
-	if (copy_from_user(&dev_info, (void *)ioctl_param, sizeof(dev_info)))
-		return -EFAULT;
+	if (copy_from_user(dev_info, (void *)ioctl_param, sizeof(*dev_info))) {
+		ret = -EFAULT;
+		goto out;
+	}
 
 	/* Check if name is zero-ended */
-	if (strnlen(dev_info.name, sizeof(dev_info.name)) == sizeof(dev_info.name)) {
+	if (strnlen(dev_info->name, sizeof(dev_info->name)) == sizeof(dev_info->name)) {
 		pr_err("kni.name not zero-terminated");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 	}
 
 	/**
 	 * Check if the cpu core id is valid for binding.
 	 */
-	if (dev_info.force_bind && !cpu_online(dev_info.core_id)) {
-		pr_err("cpu %u is not online\n", dev_info.core_id);
-		return -EINVAL;
+	if (dev_info->force_bind && !cpu_online(dev_info->core_id)) {
+		pr_err("cpu %u is not online\n", dev_info->core_id);
+		ret = -EINVAL;
+		goto out;
 	}
 
 	/* Check if it has been created */
 	down_read(&knet->kni_list_lock);
 	list_for_each_entry_safe(dev, n, &knet->kni_list_head, list) {
-		if (kni_check_param(dev, &dev_info) < 0) {
+		if (kni_check_param(dev, dev_info) < 0) {
 			up_read(&knet->kni_list_lock);
-			return -EINVAL;
+			ret = -EINVAL;
+			goto out;
 		}
 	}
+	tx_queues_num = dev_info->queues_num;
 	up_read(&knet->kni_list_lock);
 
-	net_dev = alloc_netdev(sizeof(struct kni_dev), dev_info.name,
+	net_dev = alloc_netdev_mqs(sizeof(struct kni_dev), dev_info->name,
 #ifdef NET_NAME_USER
-							NET_NAME_USER,
+					NET_NAME_USER,
 #endif
-							kni_net_init);
+					kni_net_init, tx_queues_num, 1);
 	if (net_dev == NULL) {
-		pr_err("error allocating device \"%s\"\n", dev_info.name);
-		return -EBUSY;
+		pr_err("error allocating device \"%s\"\n", dev_info->name);
+		ret = -EBUSY;
+		goto out;
 	}
 
 	dev_net_set(net_dev, net);
@@ -344,60 +362,68 @@  kni_ioctl_create(struct net *net, uint32_t ioctl_num,
 	kni = netdev_priv(net_dev);
 
 	kni->net_dev = net_dev;
-	kni->core_id = dev_info.core_id;
-	strncpy(kni->name, dev_info.name, RTE_KNI_NAMESIZE);
-
+	kni->core_id = dev_info->core_id;
+	strncpy(kni->name, dev_info->name, RTE_KNI_NAMESIZE);
+	kni->name[RTE_KNI_NAMESIZE - 1] = '\0';
+	kni->queues_num = tx_queues_num;
+	kni->fifos_num = dev_info->fifos_num;
 	/* Translate user space info into kernel space info */
-	if (dev_info.iova_mode) {
+	if (dev_info->iova_mode) {
 #ifdef HAVE_IOVA_TO_KVA_MAPPING_SUPPORT
-		kni->tx_q = iova_to_kva(current, dev_info.tx_phys);
-		kni->rx_q = iova_to_kva(current, dev_info.rx_phys);
-		kni->alloc_q = iova_to_kva(current, dev_info.alloc_phys);
-		kni->free_q = iova_to_kva(current, dev_info.free_phys);
-
-		kni->req_q = iova_to_kva(current, dev_info.req_phys);
-		kni->resp_q = iova_to_kva(current, dev_info.resp_phys);
-		kni->sync_va = dev_info.sync_va;
-		kni->sync_kva = iova_to_kva(current, dev_info.sync_phys);
+		for (i = 0; i < kni->fifos_num; i++) {
+			kni->tx_q[i] = iova_to_kva(current, dev_info->tx_phys[i]);
+			kni->rx_q[i] = iova_to_kva(current, dev_info->rx_phys[i]);
+			kni->alloc_q[i] = iova_to_kva(current, dev_info->alloc_phys[i]);
+			kni->free_q[i] = iova_to_kva(current, dev_info->free_phys[i]);
+		}
+
+		kni->req_q = iova_to_kva(current, dev_info->req_phys);
+		kni->resp_q = iova_to_kva(current, dev_info->resp_phys);
+		kni->sync_va = dev_info->sync_va;
+		kni->sync_kva = iova_to_kva(current, dev_info->sync_phys);
 		kni->usr_tsk = current;
 		kni->iova_mode = 1;
 #else
 		pr_err("KNI module does not support IOVA to VA translation\n");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 #endif
 	} else {
+		for (i = 0; i < kni->fifos_num; i++) {
+			kni->tx_q[i] = phys_to_virt(dev_info->tx_phys[i]);
+			kni->rx_q[i] = phys_to_virt(dev_info->rx_phys[i]);
+			kni->alloc_q[i] = phys_to_virt(dev_info->alloc_phys[i]);
+			kni->free_q[i] = phys_to_virt(dev_info->free_phys[i]);
+		}
 
-		kni->tx_q = phys_to_virt(dev_info.tx_phys);
-		kni->rx_q = phys_to_virt(dev_info.rx_phys);
-		kni->alloc_q = phys_to_virt(dev_info.alloc_phys);
-		kni->free_q = phys_to_virt(dev_info.free_phys);
-
-		kni->req_q = phys_to_virt(dev_info.req_phys);
-		kni->resp_q = phys_to_virt(dev_info.resp_phys);
-		kni->sync_va = dev_info.sync_va;
-		kni->sync_kva = phys_to_virt(dev_info.sync_phys);
+		kni->req_q = phys_to_virt(dev_info->req_phys);
+		kni->resp_q = phys_to_virt(dev_info->resp_phys);
+		kni->sync_va = dev_info->sync_va;
+		kni->sync_kva = phys_to_virt(dev_info->sync_phys);
 		kni->iova_mode = 0;
 	}
 
-	kni->mbuf_size = dev_info.mbuf_size;
-
-	pr_debug("tx_phys:      0x%016llx, tx_q addr:      0x%p\n",
-		(unsigned long long) dev_info.tx_phys, kni->tx_q);
-	pr_debug("rx_phys:      0x%016llx, rx_q addr:      0x%p\n",
-		(unsigned long long) dev_info.rx_phys, kni->rx_q);
-	pr_debug("alloc_phys:   0x%016llx, alloc_q addr:   0x%p\n",
-		(unsigned long long) dev_info.alloc_phys, kni->alloc_q);
-	pr_debug("free_phys:    0x%016llx, free_q addr:    0x%p\n",
-		(unsigned long long) dev_info.free_phys, kni->free_q);
+	kni->mbuf_size = dev_info->mbuf_size;
+
+	for (i = 0; i < kni->fifos_num; i++) {
+		pr_debug("tx_phys[%d]:      0x%016llx, tx_q[%d] addr:      0x%p\n",
+			 i, (unsigned long long) dev_info->tx_phys[i], i, kni->tx_q[i]);
+		pr_debug("rx_phys[%d]:      0x%016llx, rx_q[%d] addr:      0x%p\n",
+			 i, (unsigned long long) dev_info->rx_phys[i], i, kni->rx_q[i]);
+		pr_debug("alloc_phys[%d]:   0x%016llx, alloc_q[%d] addr:   0x%p\n",
+			 i, (unsigned long long) dev_info->alloc_phys[i], i, kni->alloc_q[i]);
+		pr_debug("free_phys[%d]:    0x%016llx, free_q[%d] addr:    0x%p\n",
+			 i, (unsigned long long) dev_info->free_phys[i], i, kni->free_q[i]);
+	}
 	pr_debug("req_phys:     0x%016llx, req_q addr:     0x%p\n",
-		(unsigned long long) dev_info.req_phys, kni->req_q);
+		(unsigned long long) dev_info->req_phys, kni->req_q);
 	pr_debug("resp_phys:    0x%016llx, resp_q addr:    0x%p\n",
-		(unsigned long long) dev_info.resp_phys, kni->resp_q);
+		(unsigned long long) dev_info->resp_phys, kni->resp_q);
 	pr_debug("mbuf_size:    %u\n", kni->mbuf_size);
 
 	/* if user has provided a valid mac address */
-	if (is_valid_ether_addr(dev_info.mac_addr))
-		memcpy(net_dev->dev_addr, dev_info.mac_addr, ETH_ALEN);
+	if (is_valid_ether_addr(dev_info->mac_addr))
+		memcpy(net_dev->dev_addr, dev_info->mac_addr, ETH_ALEN);
 	else
 		/*
 		 * Generate random mac address. eth_random_addr() is the
@@ -405,39 +431,43 @@  kni_ioctl_create(struct net *net, uint32_t ioctl_num,
 		 */
 		random_ether_addr(net_dev->dev_addr);
 
-	if (dev_info.mtu)
-		net_dev->mtu = dev_info.mtu;
+	if (dev_info->mtu)
+		net_dev->mtu = dev_info->mtu;
 #ifdef HAVE_MAX_MTU_PARAM
 	net_dev->max_mtu = net_dev->mtu;
 
-	if (dev_info.min_mtu)
-		net_dev->min_mtu = dev_info.min_mtu;
+	if (dev_info->min_mtu)
+		net_dev->min_mtu = dev_info->min_mtu;
 
-	if (dev_info.max_mtu)
-		net_dev->max_mtu = dev_info.max_mtu;
+	if (dev_info->max_mtu)
+		net_dev->max_mtu = dev_info->max_mtu;
 #endif
 
 	ret = register_netdev(net_dev);
 	if (ret) {
 		pr_err("error %i registering device \"%s\"\n",
-					ret, dev_info.name);
+					ret, dev_info->name);
 		kni->net_dev = NULL;
 		kni_dev_remove(kni);
 		free_netdev(net_dev);
-		return -ENODEV;
+		ret = -ENODEV;
 	}
 
 	netif_carrier_off(net_dev);
 
-	ret = kni_run_thread(knet, kni, dev_info.force_bind);
+	ret = kni_run_thread(knet, kni, dev_info->force_bind);
 	if (ret != 0)
-		return ret;
+		goto out;
 
 	down_write(&knet->kni_list_lock);
 	list_add(&kni->list, &knet->kni_list_head);
 	up_write(&knet->kni_list_lock);
 
-	return 0;
+	ret = 0;
+
+out:
+	kfree(dev_info);
+	return ret;
 }
 
 static int
@@ -447,21 +477,36 @@  kni_ioctl_release(struct net *net, uint32_t ioctl_num,
 	struct kni_net *knet = net_generic(net, kni_net_id);
 	int ret = -EINVAL;
 	struct kni_dev *dev, *n;
-	struct rte_kni_device_info dev_info;
+	struct rte_kni_device_info *dev_info;
+
+	/* allocate dev_info from heap to avoid Wframe-larger-than=1024
+	 * compile error.
+	 */
+
+	dev_info = kzalloc(sizeof(struct rte_kni_device_info), GFP_KERNEL);
+	if (!dev_info)
+		return -ENOMEM;
+
 
-	if (_IOC_SIZE(ioctl_num) > sizeof(dev_info))
+	if (_IOC_SIZE(ioctl_num) > sizeof(*dev_info)) {
+		kfree(dev_info);
 		return -EINVAL;
+	}
 
-	if (copy_from_user(&dev_info, (void *)ioctl_param, sizeof(dev_info)))
+	if (copy_from_user(dev_info, (void *)ioctl_param, sizeof(*dev_info))) {
+		kfree(dev_info);
 		return -EFAULT;
+	}
 
 	/* Release the network device according to its name */
-	if (strlen(dev_info.name) == 0)
+	if (strlen(dev_info->name) == 0) {
+		kfree(dev_info);
 		return -EINVAL;
+	}
 
 	down_write(&knet->kni_list_lock);
 	list_for_each_entry_safe(dev, n, &knet->kni_list_head, list) {
-		if (strncmp(dev->name, dev_info.name, RTE_KNI_NAMESIZE) != 0)
+		if (strncmp(dev->name, dev_info->name, RTE_KNI_NAMESIZE) != 0)
 			continue;
 
 		if (multiple_kthread_on && dev->pthread != NULL) {
@@ -476,8 +521,8 @@  kni_ioctl_release(struct net *net, uint32_t ioctl_num,
 	}
 	up_write(&knet->kni_list_lock);
 	pr_info("%s release kni named %s\n",
-		(ret == 0 ? "Successfully" : "Unsuccessfully"), dev_info.name);
-
+		(ret == 0 ? "Successfully" : "Unsuccessfully"), dev_info->name);
+	kfree(dev_info);
 	return ret;
 }
 
diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
index 4b75208..6dbd22c 100644
--- a/kernel/linux/kni/kni_net.c
+++ b/kernel/linux/kni/kni_net.c
@@ -29,9 +29,9 @@ 
 #define KNI_WAIT_RESPONSE_TIMEOUT 300 /* 3 seconds */
 
 /* typedef for rx function */
-typedef void (*kni_net_rx_t)(struct kni_dev *kni);
+typedef void (*kni_net_rx_t)(struct kni_dev *kni, int index);
 
-static void kni_net_rx_normal(struct kni_dev *kni);
+static void kni_net_rx_normal(struct kni_dev *kni, int index);
 
 /* kni rx function pointer, with default to normal rx */
 static kni_net_rx_t kni_net_rx_func = kni_net_rx_normal;
@@ -241,10 +241,17 @@  kni_fifo_trans_pa2va(struct kni_dev *kni,
 /* Try to release mbufs when kni release */
 void kni_net_release_fifo_phy(struct kni_dev *kni)
 {
-	/* release rx_q first, because it can't release in userspace */
-	kni_fifo_trans_pa2va(kni, kni->rx_q, kni->free_q);
-	/* release alloc_q for speeding up kni release in userspace */
-	kni_fifo_trans_pa2va(kni, kni->alloc_q, kni->free_q);
+	unsigned int i;
+
+	for (i = 0; i < kni->fifos_num; i++) {
+		/* release rx_q first, because it can't release in userspace */
+		kni_fifo_trans_pa2va(kni, kni->rx_q[i], kni->free_q[i]);
+	}
+
+	for (i = 0; i < kni->fifos_num; i++) {
+		/* release alloc_q for speeding up kni release in userspace */
+		kni_fifo_trans_pa2va(kni, kni->alloc_q[i], kni->free_q[i]);
+	}
 }
 
 /*
@@ -261,6 +268,24 @@  kni_net_config(struct net_device *dev, struct ifmap *map)
 }
 
 /*
+ * Select a tx fifo to enqueue the packets
+ */
+static unsigned
+kni_net_select_fifo(struct sk_buff *skb, struct kni_dev *kni)
+{
+	u32 hash;
+	unsigned int fifo_idx, fifos_num = kni->fifos_num;
+
+	if (unlikely(fifos_num == 1))
+		return 0;
+
+	hash = skb_get_hash(skb);
+	fifo_idx = hash % fifos_num;
+
+	return fifo_idx;
+}
+
+/*
  * Transmit a packet (called by the kernel)
  */
 static int
@@ -272,6 +297,7 @@  kni_net_tx(struct sk_buff *skb, struct net_device *dev)
 	struct rte_kni_mbuf *pkt_kva = NULL;
 	void *pkt_pa = NULL;
 	void *pkt_va = NULL;
+	unsigned int fifo_idx;
 
 	/* save the timestamp */
 #ifdef HAVE_TRANS_START_HELPER
@@ -284,12 +310,14 @@  kni_net_tx(struct sk_buff *skb, struct net_device *dev)
 	if (skb->len > kni->mbuf_size)
 		goto drop;
 
+	fifo_idx = kni_net_select_fifo(skb, kni);
+
 	/**
 	 * Check if it has at least one free entry in tx_q and
 	 * one entry in alloc_q.
 	 */
-	if (kni_fifo_free_count(kni->tx_q) == 0 ||
-			kni_fifo_count(kni->alloc_q) == 0) {
+	if (kni_fifo_free_count(kni->tx_q[fifo_idx]) == 0 ||
+			kni_fifo_count(kni->alloc_q[fifo_idx]) == 0) {
 		/**
 		 * If no free entry in tx_q or no entry in alloc_q,
 		 * drops skb and goes out.
@@ -298,7 +326,7 @@  kni_net_tx(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	/* dequeue a mbuf from alloc_q */
-	ret = kni_fifo_get(kni->alloc_q, &pkt_pa, 1);
+	ret = kni_fifo_get(kni->alloc_q[fifo_idx], &pkt_pa, 1);
 	if (likely(ret == 1)) {
 		void *data_kva;
 
@@ -316,7 +344,7 @@  kni_net_tx(struct sk_buff *skb, struct net_device *dev)
 		pkt_kva->data_len = len;
 
 		/* enqueue mbuf into tx_q */
-		ret = kni_fifo_put(kni->tx_q, &pkt_va, 1);
+		ret = kni_fifo_put(kni->tx_q[fifo_idx], &pkt_va, 1);
 		if (unlikely(ret != 1)) {
 			/* Failing should not happen */
 			pr_err("Fail to enqueue mbuf into tx_q\n");
@@ -347,7 +375,7 @@  kni_net_tx(struct sk_buff *skb, struct net_device *dev)
  * RX: normal working mode
  */
 static void
-kni_net_rx_normal(struct kni_dev *kni)
+kni_net_rx_normal(struct kni_dev *kni, int index)
 {
 	uint32_t ret;
 	uint32_t len;
@@ -358,7 +386,7 @@  kni_net_rx_normal(struct kni_dev *kni)
 	struct net_device *dev = kni->net_dev;
 
 	/* Get the number of free entries in free_q */
-	num_fq = kni_fifo_free_count(kni->free_q);
+	num_fq = kni_fifo_free_count(kni->free_q[index]);
 	if (num_fq == 0) {
 		/* No room on the free_q, bail out */
 		return;
@@ -368,7 +396,7 @@  kni_net_rx_normal(struct kni_dev *kni)
 	num_rx = min_t(uint32_t, num_fq, MBUF_BURST_SZ);
 
 	/* Burst dequeue from rx_q */
-	num_rx = kni_fifo_get(kni->rx_q, kni->pa, num_rx);
+	num_rx = kni_fifo_get(kni->rx_q[index], kni->pa, num_rx);
 	if (num_rx == 0)
 		return;
 
@@ -419,7 +447,7 @@  kni_net_rx_normal(struct kni_dev *kni)
 	}
 
 	/* Burst enqueue mbufs into free_q */
-	ret = kni_fifo_put(kni->free_q, kni->va, num_rx);
+	ret = kni_fifo_put(kni->free_q[index], kni->va, num_rx);
 	if (ret != num_rx)
 		/* Failing should not happen */
 		pr_err("Fail to enqueue entries into free_q\n");
@@ -429,7 +457,7 @@  kni_net_rx_normal(struct kni_dev *kni)
  * RX: loopback with enqueue/dequeue fifos.
  */
 static void
-kni_net_rx_lo_fifo(struct kni_dev *kni)
+kni_net_rx_lo_fifo(struct kni_dev *kni, int index)
 {
 	uint32_t ret;
 	uint32_t len;
@@ -441,16 +469,16 @@  kni_net_rx_lo_fifo(struct kni_dev *kni)
 	struct net_device *dev = kni->net_dev;
 
 	/* Get the number of entries in rx_q */
-	num_rq = kni_fifo_count(kni->rx_q);
+	num_rq = kni_fifo_count(kni->rx_q[index]);
 
 	/* Get the number of free entries in tx_q */
-	num_tq = kni_fifo_free_count(kni->tx_q);
+	num_tq = kni_fifo_free_count(kni->tx_q[index]);
 
 	/* Get the number of entries in alloc_q */
-	num_aq = kni_fifo_count(kni->alloc_q);
+	num_aq = kni_fifo_count(kni->alloc_q[index]);
 
 	/* Get the number of free entries in free_q */
-	num_fq = kni_fifo_free_count(kni->free_q);
+	num_fq = kni_fifo_free_count(kni->free_q[index]);
 
 	/* Calculate the number of entries to be dequeued from rx_q */
 	num = min(num_rq, num_tq);
@@ -463,12 +491,12 @@  kni_net_rx_lo_fifo(struct kni_dev *kni)
 		return;
 
 	/* Burst dequeue from rx_q */
-	ret = kni_fifo_get(kni->rx_q, kni->pa, num);
+	ret = kni_fifo_get(kni->rx_q[index], kni->pa, num);
 	if (ret == 0)
 		return; /* Failing should not happen */
 
 	/* Dequeue entries from alloc_q */
-	ret = kni_fifo_get(kni->alloc_q, kni->alloc_pa, num);
+	ret = kni_fifo_get(kni->alloc_q[index], kni->alloc_pa, num);
 	if (ret) {
 		num = ret;
 		/* Copy mbufs */
@@ -498,14 +526,14 @@  kni_net_rx_lo_fifo(struct kni_dev *kni)
 		}
 
 		/* Burst enqueue mbufs into tx_q */
-		ret = kni_fifo_put(kni->tx_q, kni->alloc_va, num);
+		ret = kni_fifo_put(kni->tx_q[index], kni->alloc_va, num);
 		if (ret != num)
 			/* Failing should not happen */
 			pr_err("Fail to enqueue mbufs into tx_q\n");
 	}
 
 	/* Burst enqueue mbufs into free_q */
-	ret = kni_fifo_put(kni->free_q, kni->va, num);
+	ret = kni_fifo_put(kni->free_q[index], kni->va, num);
 	if (ret != num)
 		/* Failing should not happen */
 		pr_err("Fail to enqueue mbufs into free_q\n");
@@ -522,7 +550,7 @@  kni_net_rx_lo_fifo(struct kni_dev *kni)
  * RX: loopback with enqueue/dequeue fifos and sk buffer copies.
  */
 static void
-kni_net_rx_lo_fifo_skb(struct kni_dev *kni)
+kni_net_rx_lo_fifo_skb(struct kni_dev *kni, int index)
 {
 	uint32_t ret;
 	uint32_t len;
@@ -533,10 +561,10 @@  kni_net_rx_lo_fifo_skb(struct kni_dev *kni)
 	struct net_device *dev = kni->net_dev;
 
 	/* Get the number of entries in rx_q */
-	num_rq = kni_fifo_count(kni->rx_q);
+	num_rq = kni_fifo_count(kni->rx_q[index]);
 
 	/* Get the number of free entries in free_q */
-	num_fq = kni_fifo_free_count(kni->free_q);
+	num_fq = kni_fifo_free_count(kni->free_q[index]);
 
 	/* Calculate the number of entries to dequeue from rx_q */
 	num = min(num_rq, num_fq);
@@ -547,7 +575,7 @@  kni_net_rx_lo_fifo_skb(struct kni_dev *kni)
 		return;
 
 	/* Burst dequeue mbufs from rx_q */
-	ret = kni_fifo_get(kni->rx_q, kni->pa, num);
+	ret = kni_fifo_get(kni->rx_q[index], kni->pa, num);
 	if (ret == 0)
 		return;
 
@@ -603,7 +631,7 @@  kni_net_rx_lo_fifo_skb(struct kni_dev *kni)
 	}
 
 	/* enqueue all the mbufs from rx_q into free_q */
-	ret = kni_fifo_put(kni->free_q, kni->va, num);
+	ret = kni_fifo_put(kni->free_q[index], kni->va, num);
 	if (ret != num)
 		/* Failing should not happen */
 		pr_err("Fail to enqueue mbufs into free_q\n");
@@ -613,11 +641,13 @@  kni_net_rx_lo_fifo_skb(struct kni_dev *kni)
 void
 kni_net_rx(struct kni_dev *kni)
 {
+	int i;
 	/**
 	 * It doesn't need to check if it is NULL pointer,
 	 * as it has a default value
 	 */
-	(*kni_net_rx_func)(kni);
+	for (i = 0; i < kni->fifos_num; i++)
+		(*kni_net_rx_func)(kni, i);
 }
 
 /*
diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
index 837d021..0b23a42 100644
--- a/lib/librte_kni/rte_kni.c
+++ b/lib/librte_kni/rte_kni.c
@@ -37,10 +37,10 @@ 
 #define KNI_MEM_CHECK(cond, fail) do { if (cond) goto fail; } while (0)
 
 #define KNI_MZ_NAME_FMT			"kni_info_%s"
-#define KNI_TX_Q_MZ_NAME_FMT		"kni_tx_%s"
-#define KNI_RX_Q_MZ_NAME_FMT		"kni_rx_%s"
-#define KNI_ALLOC_Q_MZ_NAME_FMT		"kni_alloc_%s"
-#define KNI_FREE_Q_MZ_NAME_FMT		"kni_free_%s"
+#define KNI_TX_Q_MZ_NAME_FMT		"kni_tx_%s.%d"
+#define KNI_RX_Q_MZ_NAME_FMT		"kni_rx_%s.%d"
+#define KNI_ALLOC_Q_MZ_NAME_FMT		"kni_alloc_%s.%d"
+#define KNI_FREE_Q_MZ_NAME_FMT		"kni_free_%s.%d"
 #define KNI_REQ_Q_MZ_NAME_FMT		"kni_req_%s"
 #define KNI_RESP_Q_MZ_NAME_FMT		"kni_resp_%s"
 #define KNI_SYNC_ADDR_MZ_NAME_FMT	"kni_sync_%s"
@@ -62,15 +62,15 @@  struct rte_kni {
 	struct rte_mempool *pktmbuf_pool;   /**< pkt mbuf mempool */
 	unsigned int mbuf_size;                 /**< mbuf size */
 
-	const struct rte_memzone *m_tx_q;   /**< TX queue memzone */
-	const struct rte_memzone *m_rx_q;   /**< RX queue memzone */
-	const struct rte_memzone *m_alloc_q;/**< Alloc queue memzone */
-	const struct rte_memzone *m_free_q; /**< Free queue memzone */
+	const struct rte_memzone *m_tx_q[RTE_MAX_LCORE];   /**< TX queue memzone */
+	const struct rte_memzone *m_rx_q[RTE_MAX_LCORE];   /**< RX queue memzone */
+	const struct rte_memzone *m_alloc_q[RTE_MAX_LCORE];/**< Alloc queue memzone */
+	const struct rte_memzone *m_free_q[RTE_MAX_LCORE]; /**< Free queue memzone */
 
-	struct rte_kni_fifo *tx_q;          /**< TX queue */
-	struct rte_kni_fifo *rx_q;          /**< RX queue */
-	struct rte_kni_fifo *alloc_q;       /**< Allocated mbufs queue */
-	struct rte_kni_fifo *free_q;        /**< To be freed mbufs queue */
+	struct rte_kni_fifo *tx_q[RTE_MAX_LCORE];          /**< TX queue */
+	struct rte_kni_fifo *rx_q[RTE_MAX_LCORE];          /**< RX queue */
+	struct rte_kni_fifo *alloc_q[RTE_MAX_LCORE];       /**< Allocated mbufs queue */
+	struct rte_kni_fifo *free_q[RTE_MAX_LCORE];        /**< To be freed mbufs queue */
 
 	const struct rte_memzone *m_req_q;  /**< Request queue memzone */
 	const struct rte_memzone *m_resp_q; /**< Response queue memzone */
@@ -82,6 +82,8 @@  struct rte_kni {
 	void *sync_addr;                   /**< Req/Resp Mem address */
 
 	struct rte_kni_ops ops;             /**< operations for request */
+	unsigned int queues_num;		    /**< Num of tx queues of KNI vnic */
+	unsigned int fifos_num;                 /**< TX/RX/Alloc/Free fifos number */
 };
 
 enum kni_ops_status {
@@ -89,8 +91,8 @@  enum kni_ops_status {
 	KNI_REQ_REGISTERED,
 };
 
-static void kni_free_mbufs(struct rte_kni *kni);
-static void kni_allocate_mbufs(struct rte_kni *kni);
+static void kni_free_mbufs(struct rte_kni *kni, unsigned int index);
+static void kni_allocate_mbufs(struct rte_kni *kni, unsigned int index);
 
 static volatile int kni_fd = -1;
 
@@ -140,29 +142,38 @@  __rte_kni_get(const char *name)
 }
 
 static int
-kni_reserve_mz(struct rte_kni *kni)
+kni_reserve_mz(struct rte_kni *kni, unsigned int fifos_num)
 {
+	unsigned int i, j;
 	char mz_name[RTE_MEMZONE_NAMESIZE];
+	for (i = 0; i < fifos_num; i++) {
+		snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_TX_Q_MZ_NAME_FMT, kni->name, i);
+		kni->m_tx_q[i] = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
+				RTE_MEMZONE_IOVA_CONTIG);
+		KNI_MEM_CHECK(kni->m_tx_q[i] == NULL, tx_q_fail);
+	}
 
-	snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_TX_Q_MZ_NAME_FMT, kni->name);
-	kni->m_tx_q = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
-			RTE_MEMZONE_IOVA_CONTIG);
-	KNI_MEM_CHECK(kni->m_tx_q == NULL, tx_q_fail);
+	for (i = 0; i < fifos_num; i++) {
+		snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_RX_Q_MZ_NAME_FMT, kni->name, i);
+		kni->m_rx_q[i] = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
+				RTE_MEMZONE_IOVA_CONTIG);
+		KNI_MEM_CHECK(kni->m_rx_q[i] == NULL, rx_q_fail);
+	}
 
-	snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_RX_Q_MZ_NAME_FMT, kni->name);
-	kni->m_rx_q = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
-			RTE_MEMZONE_IOVA_CONTIG);
-	KNI_MEM_CHECK(kni->m_rx_q == NULL, rx_q_fail);
 
-	snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_ALLOC_Q_MZ_NAME_FMT, kni->name);
-	kni->m_alloc_q = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
-			RTE_MEMZONE_IOVA_CONTIG);
-	KNI_MEM_CHECK(kni->m_alloc_q == NULL, alloc_q_fail);
+	for (i = 0; i < fifos_num; i++) {
+		snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_ALLOC_Q_MZ_NAME_FMT, kni->name, i);
+		kni->m_alloc_q[i] = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
+				RTE_MEMZONE_IOVA_CONTIG);
+		KNI_MEM_CHECK(kni->m_alloc_q[i] == NULL, alloc_q_fail);
+	}
 
-	snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_FREE_Q_MZ_NAME_FMT, kni->name);
-	kni->m_free_q = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
-			RTE_MEMZONE_IOVA_CONTIG);
-	KNI_MEM_CHECK(kni->m_free_q == NULL, free_q_fail);
+	for (i = 0; i < fifos_num; i++) {
+		snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_FREE_Q_MZ_NAME_FMT, kni->name, i);
+		kni->m_free_q[i] = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
+				RTE_MEMZONE_IOVA_CONTIG);
+		KNI_MEM_CHECK(kni->m_free_q[i] == NULL, free_q_fail);
+	}
 
 	snprintf(mz_name, RTE_MEMZONE_NAMESIZE, KNI_REQ_Q_MZ_NAME_FMT, kni->name);
 	kni->m_req_q = rte_memzone_reserve(mz_name, KNI_FIFO_SIZE, SOCKET_ID_ANY,
@@ -186,24 +197,51 @@  kni_reserve_mz(struct rte_kni *kni)
 resp_q_fail:
 	rte_memzone_free(kni->m_req_q);
 req_q_fail:
-	rte_memzone_free(kni->m_free_q);
+	for (j = 0; j < fifos_num; j++) {
+		rte_memzone_free(kni->m_alloc_q[j]);
+		rte_memzone_free(kni->m_rx_q[j]);
+		rte_memzone_free(kni->m_tx_q[j]);
+		rte_memzone_free(kni->m_free_q[j]);
+	}
+	return -1;
 free_q_fail:
-	rte_memzone_free(kni->m_alloc_q);
+	for (j = 0; j < fifos_num; j++) {
+		rte_memzone_free(kni->m_alloc_q[j]);
+		rte_memzone_free(kni->m_rx_q[j]);
+		rte_memzone_free(kni->m_tx_q[j]);
+	}
+	for (j = 0; j < i; j++)
+		rte_memzone_free(kni->m_free_q[j]);
+	return -1;
 alloc_q_fail:
-	rte_memzone_free(kni->m_rx_q);
+	for (j = 0; j < fifos_num; j++) {
+		rte_memzone_free(kni->m_rx_q[j]);
+		rte_memzone_free(kni->m_tx_q[j]);
+	}
+	for (j = 0; j < i; j++)
+		rte_memzone_free(kni->m_alloc_q[j]);
+	return -1;
 rx_q_fail:
-	rte_memzone_free(kni->m_tx_q);
+	for (j = 0; j < fifos_num; j++)
+		rte_memzone_free(kni->m_tx_q[j]);
+	for (j = 0; j < i; j++)
+		rte_memzone_free(kni->m_rx_q[j]);
+	return -1;
 tx_q_fail:
+	for (j = 0; j < i; j++)
+		rte_memzone_free(kni->m_tx_q[j]);
 	return -1;
 }
 
 static void
-kni_release_mz(struct rte_kni *kni)
+kni_release_mz(struct rte_kni *kni, unsigned int fifos_num)
 {
-	rte_memzone_free(kni->m_tx_q);
-	rte_memzone_free(kni->m_rx_q);
-	rte_memzone_free(kni->m_alloc_q);
-	rte_memzone_free(kni->m_free_q);
+	for (unsigned int i = 0; i < fifos_num; i++) {
+		rte_memzone_free(kni->m_tx_q[i]);
+		rte_memzone_free(kni->m_rx_q[i]);
+		rte_memzone_free(kni->m_alloc_q[i]);
+		rte_memzone_free(kni->m_free_q[i]);
+	}
 	rte_memzone_free(kni->m_req_q);
 	rte_memzone_free(kni->m_resp_q);
 	rte_memzone_free(kni->m_sync_addr);
@@ -215,6 +253,7 @@  rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
 	      struct rte_kni_ops *ops)
 {
 	int ret;
+	unsigned int i, fifos_num;
 	struct rte_kni_device_info dev_info;
 	struct rte_kni *kni;
 	struct rte_tailq_entry *te;
@@ -264,34 +303,47 @@  rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
 	dev_info.mtu = conf->mtu;
 	dev_info.min_mtu = conf->min_mtu;
 	dev_info.max_mtu = conf->max_mtu;
-
+	dev_info.queues_num = conf->queues_num ? conf->queues_num : 1;
+	dev_info.fifos_num = conf->fifos_num ? conf->fifos_num : 1;
 	memcpy(dev_info.mac_addr, conf->mac_addr, RTE_ETHER_ADDR_LEN);
 
 	strlcpy(dev_info.name, conf->name, RTE_KNI_NAMESIZE);
 
-	ret = kni_reserve_mz(kni);
+	ret = kni_reserve_mz(kni, dev_info.fifos_num);
 	if (ret < 0)
 		goto mz_fail;
 
+	fifos_num = dev_info.fifos_num;
+	kni->fifos_num = fifos_num;
+	kni->queues_num = dev_info.queues_num;
+
 	/* TX RING */
-	kni->tx_q = kni->m_tx_q->addr;
-	kni_fifo_init(kni->tx_q, KNI_FIFO_COUNT_MAX);
-	dev_info.tx_phys = kni->m_tx_q->iova;
+	for (i = 0; i < fifos_num; i++) {
+		kni->tx_q[i] = kni->m_tx_q[i]->addr;
+		kni_fifo_init(kni->tx_q[i], KNI_FIFO_COUNT_MAX);
+		dev_info.tx_phys[i] = kni->m_tx_q[i]->iova;
+	}
 
 	/* RX RING */
-	kni->rx_q = kni->m_rx_q->addr;
-	kni_fifo_init(kni->rx_q, KNI_FIFO_COUNT_MAX);
-	dev_info.rx_phys = kni->m_rx_q->iova;
+	for (i = 0; i < fifos_num; i++) {
+		kni->rx_q[i] = kni->m_rx_q[i]->addr;
+		kni_fifo_init(kni->rx_q[i], KNI_FIFO_COUNT_MAX);
+		dev_info.rx_phys[i] = kni->m_rx_q[i]->iova;
+	}
 
 	/* ALLOC RING */
-	kni->alloc_q = kni->m_alloc_q->addr;
-	kni_fifo_init(kni->alloc_q, KNI_FIFO_COUNT_MAX);
-	dev_info.alloc_phys = kni->m_alloc_q->iova;
+	for (i = 0; i < fifos_num; i++) {
+		kni->alloc_q[i] = kni->m_alloc_q[i]->addr;
+		kni_fifo_init(kni->alloc_q[i], KNI_FIFO_COUNT_MAX);
+		dev_info.alloc_phys[i] = kni->m_alloc_q[i]->iova;
+	}
 
 	/* FREE RING */
-	kni->free_q = kni->m_free_q->addr;
-	kni_fifo_init(kni->free_q, KNI_FIFO_COUNT_MAX);
-	dev_info.free_phys = kni->m_free_q->iova;
+	for (i = 0; i < fifos_num; i++) {
+		kni->free_q[i] = kni->m_free_q[i]->addr;
+		kni_fifo_init(kni->free_q[i], KNI_FIFO_COUNT_MAX);
+		dev_info.free_phys[i] = kni->m_free_q[i]->iova;
+	}
 
 	/* Request RING */
 	kni->req_q = kni->m_req_q->addr;
@@ -326,12 +378,13 @@  rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
 	rte_mcfg_tailq_write_unlock();
 
 	/* Allocate mbufs and then put them into alloc_q */
-	kni_allocate_mbufs(kni);
+	for (i = 0; i < fifos_num; i++)
+		kni_allocate_mbufs(kni, i);
 
 	return kni;
 
 ioctl_fail:
-	kni_release_mz(kni);
+	kni_release_mz(kni, fifos_num);
 mz_fail:
 	rte_free(kni);
 kni_fail:
@@ -407,7 +460,7 @@  rte_kni_release(struct rte_kni *kni)
 	struct rte_kni_list *kni_list;
 	struct rte_kni_device_info dev_info;
 	uint32_t retry = 5;
-
+	unsigned int i;
 	if (!kni)
 		return -1;
 
@@ -436,17 +489,24 @@  rte_kni_release(struct rte_kni *kni)
 	/* mbufs in all fifo should be released, except request/response */
 
 	/* wait until all rxq packets processed by kernel */
-	while (kni_fifo_count(kni->rx_q) && retry--)
-		usleep(1000);
+	for (i = 0; i < kni->fifos_num; i++) {
+		while (kni_fifo_count(kni->rx_q[i]) && retry--)
+			usleep(1000);
+		retry = 5;
+	}
 
-	if (kni_fifo_count(kni->rx_q))
-		RTE_LOG(ERR, KNI, "Fail to free all Rx-q items\n");
+	for (i = 0; i < kni->fifos_num; i++) {
+		if (kni_fifo_count(kni->rx_q[i]))
+			RTE_LOG(ERR, KNI, "Fail to free all Rx-q items for queue: %d\n", i);
+	}
 
-	kni_free_fifo_phy(kni->pktmbuf_pool, kni->alloc_q);
-	kni_free_fifo(kni->tx_q);
-	kni_free_fifo(kni->free_q);
+	for (i = 0; i < kni->fifos_num; i++) {
+		kni_free_fifo_phy(kni->pktmbuf_pool, kni->alloc_q[i]);
+		kni_free_fifo(kni->tx_q[i]);
+		kni_free_fifo(kni->free_q[i]);
+	}
 
-	kni_release_mz(kni);
+	kni_release_mz(kni, kni->fifos_num);
 
 	rte_free(kni);
 
@@ -602,9 +662,10 @@  rte_kni_handle_request(struct rte_kni *kni)
 }
 
 unsigned
-rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned int num)
+rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs,
+		unsigned int num, unsigned int index)
 {
-	num = RTE_MIN(kni_fifo_free_count(kni->rx_q), num);
+	num = RTE_MIN(kni_fifo_free_count(kni->rx_q[index]), num);
 	void *phy_mbufs[num];
 	unsigned int ret;
 	unsigned int i;
@@ -612,33 +673,34 @@  rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned int num)
 	for (i = 0; i < num; i++)
 		phy_mbufs[i] = va2pa_all(mbufs[i]);
 
-	ret = kni_fifo_put(kni->rx_q, phy_mbufs, num);
+	ret = kni_fifo_put(kni->rx_q[index], phy_mbufs, num);
 
 	/* Get mbufs from free_q and then free them */
-	kni_free_mbufs(kni);
+	kni_free_mbufs(kni, index);
 
 	return ret;
 }
 
 unsigned
-rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned int num)
+rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs,
+		unsigned int num, unsigned int index)
 {
-	unsigned int ret = kni_fifo_get(kni->tx_q, (void **)mbufs, num);
+	unsigned int ret = kni_fifo_get(kni->tx_q[index], (void **)mbufs, num);
 
 	/* If buffers removed, allocate mbufs and then put them into alloc_q */
 	if (ret)
-		kni_allocate_mbufs(kni);
+		kni_allocate_mbufs(kni, index);
 
 	return ret;
 }
 
 static void
-kni_free_mbufs(struct rte_kni *kni)
+kni_free_mbufs(struct rte_kni *kni, unsigned int index)
 {
 	int i, ret;
 	struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM];
 
-	ret = kni_fifo_get(kni->free_q, (void **)pkts, MAX_MBUF_BURST_NUM);
+	ret = kni_fifo_get(kni->free_q[index], (void **)pkts, MAX_MBUF_BURST_NUM);
 	if (likely(ret > 0)) {
 		for (i = 0; i < ret; i++)
 			rte_pktmbuf_free(pkts[i]);
@@ -646,7 +708,7 @@  kni_free_mbufs(struct rte_kni *kni)
 }
 
 static void
-kni_allocate_mbufs(struct rte_kni *kni)
+kni_allocate_mbufs(struct rte_kni *kni, unsigned int index)
 {
 	int i, ret;
 	struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM];
@@ -674,7 +736,7 @@  kni_allocate_mbufs(struct rte_kni *kni)
 		return;
 	}
 
-	allocq_free = (kni->alloc_q->read - kni->alloc_q->write - 1)
+	allocq_free = (kni->alloc_q[index]->read - kni->alloc_q[index]->write - 1)
 			& (MAX_MBUF_BURST_NUM - 1);
 	for (i = 0; i < allocq_free; i++) {
 		pkts[i] = rte_pktmbuf_alloc(kni->pktmbuf_pool);
@@ -690,7 +752,7 @@  kni_allocate_mbufs(struct rte_kni *kni)
 	if (i <= 0)
 		return;
 
-	ret = kni_fifo_put(kni->alloc_q, phys, i);
+	ret = kni_fifo_put(kni->alloc_q[index], phys, i);
 
 	/* Check if any mbufs not put into alloc_q, and then free them */
 	if (ret >= 0 && ret < i && ret < MAX_MBUF_BURST_NUM) {
diff --git a/lib/librte_kni/rte_kni.h b/lib/librte_kni/rte_kni.h
index b0eaf46..31fe42a 100644
--- a/lib/librte_kni/rte_kni.h
+++ b/lib/librte_kni/rte_kni.h
@@ -75,6 +75,9 @@  struct rte_kni_conf {
 	uint16_t mtu;
 	uint16_t min_mtu;
 	uint16_t max_mtu;
+
+	unsigned int fifos_num;
+	unsigned int queues_num;
 };
 
 /**
@@ -162,12 +165,14 @@  int rte_kni_handle_request(struct rte_kni *kni);
  *  The array to store the pointers of mbufs.
  * @param num
  *  The maximum number per burst.
+ * @param index
+ *  The rx_q fifo's index of the KNI interface.
  *
  * @return
  *  The actual number of packets retrieved.
  */
 unsigned rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs,
-		unsigned num);
+		unsigned int num, unsigned int index);
 
 /**
  * Send a burst of packets to a KNI interface. The packets to be sent out are
@@ -181,12 +186,14 @@  unsigned rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs,
  *  The array to store the pointers of mbufs.
  * @param num
  *  The maximum number per burst.
+ * @param index
+ *  The tx_q fifo's index of the KNI interface.
  *
  * @return
  *  The actual number of packets sent.
  */
 unsigned rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs,
-		unsigned num);
+		unsigned int num, unsigned int index);
 
 /**
  * Get the KNI context of its name.
diff --git a/lib/librte_kni/rte_kni_common.h b/lib/librte_kni/rte_kni_common.h
index ffb3182..35afebf 100644
--- a/lib/librte_kni/rte_kni_common.h
+++ b/lib/librte_kni/rte_kni_common.h
@@ -99,10 +99,10 @@  struct rte_kni_mbuf {
 struct rte_kni_device_info {
 	char name[RTE_KNI_NAMESIZE];  /**< Network device name for KNI */
 
-	phys_addr_t tx_phys;
-	phys_addr_t rx_phys;
-	phys_addr_t alloc_phys;
-	phys_addr_t free_phys;
+	phys_addr_t tx_phys[RTE_MAX_LCORE];
+	phys_addr_t rx_phys[RTE_MAX_LCORE];
+	phys_addr_t alloc_phys[RTE_MAX_LCORE];
+	phys_addr_t free_phys[RTE_MAX_LCORE];
 
 	/* Used by Ethtool */
 	phys_addr_t req_phys;
@@ -127,6 +127,8 @@  struct rte_kni_device_info {
 	unsigned int max_mtu;
 	uint8_t mac_addr[6];
 	uint8_t iova_mode;
+	unsigned int fifos_num;
+	unsigned int queues_num;
 };
 
 #define KNI_DEVICE "kni"
diff --git a/lib/librte_port/rte_port_kni.c b/lib/librte_port/rte_port_kni.c
index 7b370f9..648b832 100644
--- a/lib/librte_port/rte_port_kni.c
+++ b/lib/librte_port/rte_port_kni.c
@@ -67,7 +67,7 @@  rte_port_kni_reader_rx(void *port, struct rte_mbuf **pkts, uint32_t n_pkts)
 			port;
 	uint16_t rx_pkt_cnt;
 
-	rx_pkt_cnt = rte_kni_rx_burst(p->kni, pkts, n_pkts);
+	rx_pkt_cnt = rte_kni_rx_burst(p->kni, pkts, n_pkts, 0);
 	RTE_PORT_KNI_READER_STATS_PKTS_IN_ADD(p, rx_pkt_cnt);
 	return rx_pkt_cnt;
 }
@@ -165,7 +165,7 @@  send_burst(struct rte_port_kni_writer *p)
 {
 	uint32_t nb_tx;
 
-	nb_tx = rte_kni_tx_burst(p->kni, p->tx_buf, p->tx_buf_count);
+	nb_tx = rte_kni_tx_burst(p->kni, p->tx_buf, p->tx_buf_count, 0);
 
 	RTE_PORT_KNI_WRITER_STATS_PKTS_DROP_ADD(p, p->tx_buf_count - nb_tx);
 	for (; nb_tx < p->tx_buf_count; nb_tx++)
@@ -208,7 +208,7 @@  rte_port_kni_writer_tx_bulk(void *port,
 			send_burst(p);
 
 		RTE_PORT_KNI_WRITER_STATS_PKTS_IN_ADD(p, n_pkts);
-		n_pkts_ok = rte_kni_tx_burst(p->kni, pkts, n_pkts);
+		n_pkts_ok = rte_kni_tx_burst(p->kni, pkts, n_pkts, 0);
 
 		RTE_PORT_KNI_WRITER_STATS_PKTS_DROP_ADD(p, n_pkts - n_pkts_ok);
 		for (; n_pkts_ok < n_pkts; n_pkts_ok++) {
@@ -349,7 +349,7 @@  send_burst_nodrop(struct rte_port_kni_writer_nodrop *p)
 {
 	uint32_t nb_tx = 0, i;
 
-	nb_tx = rte_kni_tx_burst(p->kni, p->tx_buf, p->tx_buf_count);
+	nb_tx = rte_kni_tx_burst(p->kni, p->tx_buf, p->tx_buf_count, 0);
 
 	/* We sent all the packets in a first try */
 	if (nb_tx >= p->tx_buf_count) {
@@ -360,7 +360,7 @@  send_burst_nodrop(struct rte_port_kni_writer_nodrop *p)
 	for (i = 0; i < p->n_retries; i++) {
 		nb_tx += rte_kni_tx_burst(p->kni,
 			p->tx_buf + nb_tx,
-			p->tx_buf_count - nb_tx);
+			p->tx_buf_count - nb_tx, 0);
 
 		/* We sent all the packets in more than one try */
 		if (nb_tx >= p->tx_buf_count) {
@@ -412,7 +412,7 @@  rte_port_kni_writer_nodrop_tx_bulk(void *port,
 			send_burst_nodrop(p);
 
 		RTE_PORT_KNI_WRITER_NODROP_STATS_PKTS_IN_ADD(p, n_pkts);
-		n_pkts_ok = rte_kni_tx_burst(p->kni, pkts, n_pkts);
+		n_pkts_ok = rte_kni_tx_burst(p->kni, pkts, n_pkts, 0);
 
 		if (n_pkts_ok >= n_pkts)
 			return 0;