[v4,03/14] vhost: add batch enqueue function for packed ring
Checks
Commit Message
Batch enqueue function will first check whether descriptors are cache
aligned. It will also check prerequisites in the beginning. Batch
enqueue function not support chained mbufs, single packet enqueue
function will handle it.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
Comments
On 10/9/19 3:38 PM, Marvin Liu wrote:
> Batch enqueue function will first check whether descriptors are cache
> aligned. It will also check prerequisites in the beginning. Batch
> enqueue function not support chained mbufs, single packet enqueue
> function will handle it.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
On 10/9/19 3:38 PM, Marvin Liu wrote:
> Batch enqueue function will first check whether descriptors are cache
> aligned. It will also check prerequisites in the beginning. Batch
> enqueue function not support chained mbufs, single packet enqueue
> function will handle it.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
Thinking again about this patch and series in general...
So this series improves performance by 40% in cases where:
- descriptors are cache aligned
- single mbuf
But my understanding is that it will cause performance regression for
the other cases, which may not be that uncommon, no?
Do you have some number about the performance impact on these other
cases?
Thanks,
Maxime
> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, October 11, 2019 10:22 PM
> To: Liu, Yong <yong.liu@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Wang,
> Zhihong <zhihong.wang@intel.com>; stephen@networkplumber.org;
> gavin.hu@arm.com
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v4 03/14] vhost: add batch enqueue function for packed
> ring
>
>
>
> On 10/9/19 3:38 PM, Marvin Liu wrote:
> > Batch enqueue function will first check whether descriptors are cache
> > aligned. It will also check prerequisites in the beginning. Batch
> > enqueue function not support chained mbufs, single packet enqueue
> > function will handle it.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
>
> Thinking again about this patch and series in general...
>
> So this series improves performance by 40% in cases where:
> - descriptors are cache aligned
> - single mbuf
>
> But my understanding is that it will cause performance regression for
> the other cases, which may not be that uncommon, no?
>
> Do you have some number about the performance impact on these other
> cases?
>
Hi Maxime,
Check prerequisites of batch handling is pretty simple and fast.
It almost has no performance impact on uncommon case.
Chained packets can slightly benefit from cache related optimization.
As shown in below table, all cases I run can benefit from vhost optimization.
From our experimental, more performance gain can be seen if more packets handled by batch.
+---------------------------------------------------+
| | 19.08 | + opt |
|-----------------------------------|-------|-------|
| 1518B PvP | 2.63M | 2.98M |
|-----------------------------------|-------|-------|
| 64B loopback | 7.81M | 12.0M |
|-----------------------------------|-------|-------|
| 1518B loopback | 3.59M | 4.69M |
|-----------------------------------|-------|-------|
| 16K chained loopback | 297K | 306K |
|-----------------------------------|-------|-------|
| 50% 256B + 50% 16K | 296K | 309K |
|-----------------------------------|-------|-------|
| pktgen_sample03_burst_single_flow | 6.03M | 6.39M |
+---------------------------------------------------+
Regards,
Marvin
> Thanks,
> Maxime
@@ -39,6 +39,10 @@
#define VHOST_LOG_CACHE_NR 32
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+ sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
#ifdef SUPPORT_GCC_UNROLL_PRAGMA
#define UNROLL_PRAGMA_PARAM "GCC unroll 4"
#endif
@@ -883,6 +883,86 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
return pkt_idx;
}
+static __rte_unused int
+virtio_dev_rx_batch_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts)
+{
+ bool wrap_counter = vq->avail_wrap_counter;
+ struct vring_packed_desc *descs = vq->desc_packed;
+ uint16_t avail_idx = vq->last_avail_idx;
+ uint64_t desc_addrs[PACKED_BATCH_SIZE];
+ struct virtio_net_hdr_mrg_rxbuf *hdrs[PACKED_BATCH_SIZE];
+ uint32_t buf_offset = dev->vhost_hlen;
+ uint64_t lens[PACKED_BATCH_SIZE];
+ uint16_t i;
+
+ if (unlikely(avail_idx & PACKED_BATCH_MASK))
+ return -1;
+
+ if (unlikely((avail_idx + PACKED_BATCH_SIZE) > vq->size))
+ return -1;
+
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++) {
+ if (unlikely(pkts[i]->next != NULL))
+ return -1;
+ if (unlikely(!desc_is_avail(&descs[avail_idx + i],
+ wrap_counter)))
+ return -1;
+ }
+
+ rte_smp_rmb();
+
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++)
+ lens[i] = descs[avail_idx + i].len;
+
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++) {
+ if (unlikely(pkts[i]->pkt_len > (lens[i] - buf_offset)))
+ return -1;
+ }
+
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++)
+ desc_addrs[i] = vhost_iova_to_vva(dev, vq,
+ descs[avail_idx + i].addr,
+ &lens[i],
+ VHOST_ACCESS_RW);
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++) {
+ if (unlikely(lens[i] != descs[avail_idx + i].len))
+ return -1;
+ }
+
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++) {
+ rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+ hdrs[i] = (struct virtio_net_hdr_mrg_rxbuf *)
+ (uintptr_t)desc_addrs[i];
+ lens[i] = pkts[i]->pkt_len + dev->vhost_hlen;
+ }
+
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++)
+ virtio_enqueue_offload(pkts[i], &hdrs[i]->hdr);
+
+ vq->last_avail_idx += PACKED_BATCH_SIZE;
+ if (vq->last_avail_idx >= vq->size) {
+ vq->last_avail_idx -= vq->size;
+ vq->avail_wrap_counter ^= 1;
+ }
+
+ UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+ for (i = 0; i < PACKED_BATCH_SIZE; i++) {
+ rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
+ rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
+ pkts[i]->pkt_len);
+ }
+
+ return 0;
+}
+
static __rte_unused int16_t
virtio_dev_rx_single_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
struct rte_mbuf *pkt)