[v3,2/2] doc: add guide for debug and troubleshoot

Message ID 20181126070815.37501-2-vipin.varghese@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v3,1/2] doc: add svg for debug and troubleshoot guide |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Varghese, Vipin Nov. 26, 2018, 7:08 a.m. UTC
  Add user guide on debug and troubleshoot for common issues and bottleneck
found in various application models running on single or multi stages.

Signed-off-by: Vipin Varghese <vipin.varghese@intel.com>
Acked-by: Marko Kovacevic <marko.kovacevic@intel.com>
---

V3:
 - reorder for removing warning in 'make doc-guides-html' - Thomas Monjalon

V2:
 - add offload flag check - Vipin Varghese
 - change tab to space - Marko Kovacevic
 - spelling correction - Marko Kovacevic
 - remove extra characters - Marko Kovacevic
 - add ACK by Marko - Vipn Varghese
---
 doc/guides/howto/debug_troubleshoot_guide.rst | 351 ++++++++++++++++++
 doc/guides/howto/index.rst                    |   1 +
 2 files changed, 352 insertions(+)
 create mode 100644 doc/guides/howto/debug_troubleshoot_guide.rst
  

Comments

Shreyansh Jain Jan. 4, 2019, 6:37 a.m. UTC | #1
Hello Vipin,

Some comments and lots of nitpicks inlined.
(I know this comes months late - apologies, just didn't stumble on this 
earlier).

On Monday 26 November 2018 12:38 PM, Vipin Varghese wrote:
> Add user guide on debug and troubleshoot for common issues and bottleneck
> found in various application models running on single or multi stages.
> 
> Signed-off-by: Vipin Varghese <vipin.varghese@intel.com>
> Acked-by: Marko Kovacevic <marko.kovacevic@intel.com>
> ---
> 
> V3:
>   - reorder for removing warning in 'make doc-guides-html' - Thomas Monjalon
> 
> V2:
>   - add offload flag check - Vipin Varghese
>   - change tab to space - Marko Kovacevic
>   - spelling correction - Marko Kovacevic
>   - remove extra characters - Marko Kovacevic
>   - add ACK by Marko - Vipn Varghese
> ---
>   doc/guides/howto/debug_troubleshoot_guide.rst | 351 ++++++++++++++++++
>   doc/guides/howto/index.rst                    |   1 +
>   2 files changed, 352 insertions(+)
>   create mode 100644 doc/guides/howto/debug_troubleshoot_guide.rst
> 
> diff --git a/doc/guides/howto/debug_troubleshoot_guide.rst b/doc/guides/howto/debug_troubleshoot_guide.rst
> new file mode 100644
> index 000000000..55589085e
> --- /dev/null
> +++ b/doc/guides/howto/debug_troubleshoot_guide.rst
> @@ -0,0 +1,351 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2018 Intel Corporation.
> +
> +.. _debug_troubleshoot_via_pmd:
> +
> +Debug & Troubleshoot guide via PMD
> +==================================
> +
> +DPDK applications can be designed to run as single thread simple stage to
> +multiple threads with complex pipeline stages. These application can use poll
> +mode devices which helps in offloading CPU cycles. A few models are
> +
> +  *  single primary
> +  *  multiple primary
> +  *  single primary single secondary
> +  *  single primary multiple secondary
> +
> +In all the above cases, it is a tedious task to isolate, debug and understand
> +odd behaviour which can occurring random or periodic. The goal of guide is to
                  ^^^^^^^^^^^^^^^^^^^^^^                           ^^^^^
    either of: which can occur randomly or periodically ...        this
               which occurs randomly or periodically

> +share and explore a few commonly seen patterns and behaviour. Then isolate and
                                                                 ^^^^^^^
                                    super nitpick:          Then, isolate

> +identify the root cause via step by step debug at various processing stages.
> +
> +Application Overview
> +--------------------
> +
> +Let us take up an example application as reference for explaining issues and
> +patterns commonly seen. The sample application in discussion makes use of
> +single primary model with various pipeline stages. The application uses PMD
> +and libraries such as service cores, mempool, pkt mbuf, event, crypto, QoS
> +and eth.
> +
> +The overview of an application modeled using PMD is shown in
> +:numref:`dtg_sample_app_model`.
> +
> +.. _dtg_sample_app_model:
> +
> +.. figure:: img/dtg_sample_app_model.*
> +
> +   Overview of pipeline stage of an application
> +
> +Bottleneck Analysis
> +-------------------
> +
> +To debug the bottleneck and performance issues the desired application
> +is made to run in an environment matching as below
> +
> +#. Linux 64-bit|32-bit
> +#. DPDK PMD and libraries are used
> +#. Libraries and PMD are either static or shared. But not both
> +#. Machine flag optimizations of gcc or compiler are made constant
> +
> +Is there mismatch in packet rate (received < send)?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +RX Port and associated core :numref:`dtg_rx_rate`.
> +
> +.. _dtg_rx_rate:
> +
> +.. figure:: img/dtg_rx_rate.*
> +
> +   RX send rate compared against Received rate
> +
> +#. are generic configuration correct?
> +    -  What is port Speed, Duplex? rte_eth_link_get()
> +    -  Is packet of higher sizes are dropped? rte_eth_get_mtu()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           Are packets of higher size dropped?

> +    -  Are only specific MAC are received? rte_eth_promiscuous_get()
                                ^^^^^^^
                           can be removed

What about checking vlan-filters, if any? - that is, if packets have 
some vlan stamped and hardware is configured to filter that.

> +
> +#. are there NIC specific drops?
> +    -  Check rte_eth_rx_queue_info_get() for nb_desc, scattered_rx,
> +    -  Check rte_eth_dev_stats() for Stats per queue
> +    -  Is stats of other queues shows no change via
> +       rte_eth_dev_dev_rss_hash_conf_get()

           Does stat for other queues shows no change via ...
Or, maybe you intend to say - check stats for all queues in case RSS is 
configured and packets are being distributed to a queue not being 
listened to.

> +    -  Check if port offload and queue offload matches.
> +
> +#. If problem still persists, this might be at RX lcore thread
> +    -  Check if RX thread, distributor or event rx adapter is holding or
> +       processing more than required
                     ^^^^^^^^^^^
    Did you mean to say "...processing less that required.."? Because if 
Rx doesn't match Tx (from a generator), it would be less processing.

> +    -  try using rte_prefetch_non_temporal() to intimate the mbuf in pulled
> +       to cache for temporary
         ^^^^^^^
Maybe this statement is incomplete - or, need rephrasing.

> +
> +
> +Are there packet drops (receive|transmit)?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +RX-TX Port and associated cores :numref:`dtg_rx_tx_drop`.
> +
> +.. _dtg_rx_tx_drop:
> +
> +.. figure:: img/dtg_rx_tx_drop.*
> +
> +   RX-TX drops
> +
> +#. at RX
> +    -  Get the rx queues by rte_eth_dev_info_get() for nb_rx_queues
           ^^^^
Or, you meant to write "Get the Rx queue info using 
rte_eth_dev_info_get()". Or maybe "Get Rx queue count using info from 
rte_eth_dev_info_get()"

> +    -  Check for miss, errors, qerros by rte_eth_dev_stats() for imissed,
> +       ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count
> +
> +#. at TX
> +    -  Are we doing in bulk to reduce the TX descriptor overhead?
            ^^^^^^
Assuming all your conversation until now is second person, 'you'.

> +    -  Check rte_eth_dev_stats() for oerrors, qerros, rte_mbuf_ref_count
> +    -  Is the packet multi segmented? Check if port and queue offlaod is set.
> +
> +Are there object drops in producer point for ring?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Producer point for ring :numref:`dtg_producer_ring`.
> +
> +.. _dtg_producer_ring:
> +
> +.. figure:: img/dtg_producer_ring.*
> +
> +   Producer point for Rings
> +
> +#. Performance for Producer
> +    -  Fetch the type of RING 'rte_ring_dump()' for flags (RING_F_SP_ENQ)
> +    -  If '(burst enqueue - actual enqueue) > 0' check rte_ring_count() or
> +       rte_ring_free_count()
> +    -  If 'burst or single enqueue is 0', then there is no more space check
> +       rte_ring_full() or not
         ^^^^^^
This line needs rephrase; maybe like: If 'burst or single enqueue is 0', 
then, if there is no more space, check rte_ring_full().
Or
... then, check if the ring is full or not using rte_ring_full()
(Or, your similar statement for rte_ring_empty(), below, looks better)

> +
> +Are there object drops in consumer point for ring?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Consumer point for ring :numref:`dtg_consumer_ring`.
> +
> +.. _dtg_consumer_ring:
> +
> +.. figure:: img/dtg_consumer_ring.*
> +
> +   Consumer point for Rings
> +
> +#. Performance for Consumer
> +    -  Fetch the type of RING – rte_ring_dump() for flags (RING_F_SC_DEQ)
> +    -  If '(burst dequeue - actual dequeue) > 0' for rte_ring_free_count()
> +    -  If 'burst or single enqueue' always results 0 check the ring is empty
> +       via rte_ring_empty()
> +
> +Is packets or objects are not processed at desired rate?
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    Are packets or objects not processed at desired rate?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Memory objects close to NUMA :numref:`dtg_mempool`.
> +
> +.. _dtg_mempool:
> +
> +.. figure:: img/dtg_mempool.*
> +
> +   Memory objects has to be close to device per NUMA
> +
> +#. Is the performance low?
> +    -  Are packets received from multiple NIC? rte_eth_dev_count_all()
> +    -  Are NIC interfaces on different socket? use rte_eth_dev_socket_id()
> +    -  Is mempool created with right socket? rte_mempool_create() or
> +       rte_pktmbuf_pool_create()
> +    -  Are we seeing drop on specific socket? It might require more
             ^^^^^
Just like above, as your perspective has been second person, 'you'.
> +       mempool objects; try allocating more object

So, Or check if mempool depletion levels are not being reached using 
"rte_mempool_get_count()" or "rte_mempool_avail_count()". That might 
help if mempool objects are falling short.
And, this might actually leads to drops rather than drop in desired rate 
(which is a consequence)

(I do notice that you have mentioned this below - but I think that is 
the first thing to check. Thats just my opinion - so, feel free to ignore)


> +    -  Is there single RX thread for multiple NIC? try having multiple
> +       lcore to read from fixed interface or we might be hitting cache
> +       limit, so Increase cache_size for pool_create()
                    ^^^
                Small 'i'
> +
> +#. Are we are still seeing low performance
> +    -  Check if sufficient objects in mempool by rte_mempool_avail_count()
> +    -  Is failure in some pkt? we might be getting pkts with size > mbuf
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1) Use packets rather than pkts or pkt
2) Did you mean "Is there failure in Rx of packets?" Or, "Packet drops 
being observed?"
> +       data size. Check rte_pktmbuf_is_continguous()
                             ^^^^^^^^^^^^^^^^^^^^^^^^^
I think you intended: rte_pktmbuf_is_contiguous()

> +    -  If user pthread is used to access object access
> +       rte_mempool_cache_create()

Maybe I didn't get the point - but, why using rte_mempool_cache_create 
helps in case of external pthreads? Using cache would be applicable in 
all cases (assuming mempool_create(..cache_size=0..).

> +    -  Try using 1GB huge pages instead of 2MB. If there is difference,
                                                    ^^^^^^^^
              If there is a difference or if there is no difference?
> +       try then rte_mem_lock_page() for 2MB pages
          ^^^^^
Should be: 'then try ...'

> +
> +.. note::
> +  Stall in release of MBUF can be because
> +
> +  *  Processing pipeline is too heavy
> +  *  Number of stages are too many
> +  *  TX is not transferred at desired rate
> +  *  Multi segment is not offloaded at TX device.
> +  *  Application misuse scenarios can be
> +      -  not freeing packets
> +      -  invalid rte_pktmbuf_refcnt_set
> +      -  invalid rte_pktmbuf_prefree_seg
> +
> +Is there difference in performance for crypto?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Crypto device and PMD :numref:`dtg_crypto`.
> +
> +.. _dtg_crypto:
> +
> +.. figure:: img/dtg_crypto.*
> +
> +   CRYPTO and interaction with PMD device
> +
> +#. are generic configuration correct?
> +    -  Get total crypto devices – rte_cryptodev_count()
> +    -  Cross check SW or HW flags are configured properly
> +       rte_cryptodev_info_get() for feature_flags
> +
> +#. If enqueue request > actual enqueue (drops)?
> +    -  Is the queue pair setup for proper node
> +       rte_cryptodev_queue_pair_setup() for socket_id
> +    -  Is the session_pool created from same socket_id as queue pair?
> +    -  Is enqueue thread same socket_id?                      ^^^^^^^^^^^^^
                  ... thread on same ...

> +    -  rte_cryptodev_stats() for drops err_count for enqueue or dequeue
> +    -  Are there multiple threads enqueue or dequeue from same queue pair?
                     ^^^^^^^^^^^^^^^^^^^^^^^
I think you meant:
Are there multiple threads enqueuing/dequeuing from same...
Or
Do multiple threads enqueue or dequeue from same ....

> +
> +#. If enqueue rate > dequeue rate?
> +    -  Is dequeue lcore thread is same socket_id?
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Do you mean dequeue lcore threads has same socket as enqueue lcore 
thread? If so, can you elaborate

> +    -  If SW crypto is in use, check if the CRYPTO Library build with
                                                           ^^^^^^^^^
                                              ...is built with...
> +       right (SIMD) flags Or check if the queue pair using CPU ISA by
                             ^^^^
                             or
> +       rte_cryptodev_info_get() for feature_flags for AVX|SSE
> +    -  If we are using HW crypto – Is the card on same NUMA socket as
                                        ^^^^^^^^^^^^^
By 'card', you mean the hardware block? So, some form factors of SoC 
might not even have 'cards'. Maybe, "Is hardware crypto block affined to 
correct NUMA socket as per application configuration" can be a better check.

> +       queue pair and session pool?
> +
> +worker functions not giving performance?
^^^^^
'W' in place of 'w'

> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Custom worker function :numref:`dtg_distributor_worker`.
> +
> +.. _dtg_distributor_worker:
> +
> +.. figure:: img/dtg_distributor_worker.*
> +
> +   Custom worker function performance drops
> +
> +#. Performance
> +    -  Threads context switches more frequently? Identify lcore with
                                         ^^^^^^^^^^^
                                         frequent

> +       rte_lcore() and lcore index mapping with rte_lcore_index(). Best
> +       performance when mapping of thread and core is 1:1.
> +    -  Check lcore role type and state? rte_eal_lcore_role for
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is a suggestion, not a question, so the '?' should be removed.

> +       rte, off and service. User function on service core might be
           ^^^^^^^^^^^^^^^^^^^^^
I think you mean: Use rte_eal_lcore_role for identifying role of core as 
either of RTE, SERVICE or OFF. Otherwise, it looks more like notes than 
a suggestion/hint.

> +       sharing timeslots with other functions.
> +    -  Check the cpu core? check rte_thread_get_affinity() and
           ^^^^^^^^^^^^^^^^^^^
This is not a question - a suggestion, so '?' should be removed.

> +       rte_eal_get_lcore_state() for run state.
> +
> +#. Debug
> +    -  Mode of operation? rte_eal_get_configuration() for master, fetch
         ^^^^^^^^^^^^^^^^^^^^^
"Check mode of operation" or "What is the mode of operation?"

> +       lcore|service|numa count, process_type.
> +    -  Check lcore run mode? rte_eal_lcore_role() for rte, off, service.
          ^^^^^^^^^^^^^^^^^^^^^^
Same as above - it looks more like notes than a troubleshooting step.

> +    -  process details? rte_dump_stack(), rte_dump_registers() and
          ^^^^^^^^^^^^^^^^^
Same as above.

> +       rte_memdump() will give insights.
> +
> +service functions are not frequent enough?
^^^^^^
'S'

> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +service functions on service cores :numref:`dtg_service`.
> +
> +.. _dtg_service:
> +
> +.. figure:: img/dtg_service.*
> +
> +   functions running on service cores
> +
> +#. Performance
> +    -  Get service core count? rte_service_lcore_count() and compare with
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
Again, this is not a question, but a suggestion and should be without 
'?'. And the statement is incoherent - "Get the service core count using 
rte_service_lcore_count() API and compare with result of ..." maybe be 
better.

> +       result of rte_eal_get_configuration()
> +    -  Check if registered service is available?
           ^^^^^^^
Suggestion, not a question.

> +       rte_service_get_by_name(), rte_service_get_count() and
> +       rte_service_get_name()
> +    -  Is given service running parallel on multiple lcores?
> +       rte_service_probe_capability() and rte_service_map_lcore_get()
> +    -  Is service running? rte_service_runstate_get()
> +
> +#. Debug
> +    -  Find how many services are running on specific service lcore by
> +       rte_service_lcore_count_services()
> +    -  Generic debug via rte_service_dump()
> +
> +Is there bottleneck in eventdev?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +#. are generic configuration correct?
> +    -  Get event_dev devices? rte_event_dev_count()
> +    -  Are they created on correct socket_id? - rte_event_dev_socket_id()
> +    -  Check if HW or SW capabilities? - rte_event_dev_info_get() for
> +       event_qos, queue_all_types, burst_mode, multiple_queue_port,
> +       max_event_queue|dequeue_depth
> +    -  Is packet stuck in queue? check for stages (event qeueue) where
> +       packets are looped back to same or previous stages.
> +
> +#. Performance drops in enqueue (event count > actual enqueue)?
> +    -  Dump the event_dev information? rte_event_dev_dump()
> +    -  Check stats for queue and port for eventdev
> +    -  Check the inflight, current queue element for enqueue|deqeue
> +
> +How to debug QoS via TM?
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +TM on TX interface :numref:`dtg_qos_tx`.
> +
> +.. _dtg_qos_tx:
> +
> +.. figure:: img/dtg_qos_tx.*
> +
> +   Traffic Manager just before TX
> +
> +#. Is configuration right?
> +    -  Get current capabilities for DPDK port rte_tm_capabilities_get()
> +       for max nodes, level, shaper_private, shaper_shared, sched_n_children
> +       and stats_mask
> +    -  Check if current leaf are configured identically rte_tm_capabilities_get()
> +       for lead_nodes_identicial
                        ^^^^^^^^^^^^^^
                        identical

> +    -  Get leaf nodes for a dpdk port - rte_tn_get_number_of_leaf_node()
> +    -  Check level capabilities by rte_tm_level_capabilities_get for n_nodes
> +        -  Max, nonleaf_max, leaf_max
> +        -  identical, non_identical
> +        -  Shaper_private_supported
> +        -  Stats_mask
> +        -  Cman wred packet|byte supported
> +        -  Cman head drop supported
> +    -  Check node capabilities by rte_tm_node_capabilities_get for n_nodes
> +        -  Shaper_private_supported
> +        -  Stats_mask
> +        -  Cman wred packet|byte supported
> +        -  Cman head drop supported
> +    -  Debug via stats - rte_tm_stats_update() and rte_tm_node_stats_read()
> +
> +Packet is not of right format?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Packet capture before and after processing :numref:`dtg_pdump`.
> +
> +.. _dtg_pdump:
> +
> +.. figure:: img/dtg_pdump.*
> +
> +   Capture points of Traffic at RX-TX
> +
> +#.  with primary enabling then secondary can access. Copies packets from
> +    specific RX or TX queues to secondary process ring buffers.

The statement above would be clearer with a rephrase. For example 'with 
primary enabling then secondary can access' is incomplete (enable what 
and access what?)

> +
> +.. note::
> +  Need to explore:

Is this section for user to explore? Or, maybe it is a residual from 
your notes?

> +    *  if secondary shares same interface can we enable from secondary
> +       for rx|tx happening on primary
> +    *  Specific PMD private data dump the details
> +    *  User private data if present, dump the details
> +
> +How to develop custom code to debug?
> +------------------------------------
> +
> +-  For single process - the debug functionality is to be added in same
> +   process
> +-  For multiple process - the debug functionality can be added to
> +   secondary multi process
> +
> +.. note::
> +
> +  Primary's Debug functions invoked via
> +    #. Timer call-back
> +    #. Service function under service core
> +    #. USR1 or USR2 signal handler
> +
> diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
> index a642a2be1..9527fa84d 100644
> --- a/doc/guides/howto/index.rst
> +++ b/doc/guides/howto/index.rst
> @@ -18,3 +18,4 @@ HowTo Guides
>       virtio_user_as_exceptional_path
>       packet_capture_framework
>       telemetry
> +    debug_troubleshoot_guide
> 


Frankly, it is a good attempt, but I think this is more in shape of 
notes right now.

I do understand that it is never going to be an easy task to list 
troubleshooting steps - that is fairly dependent on env and use-case. 
But, if maybe you can rephrase and add some context to some sections, it 
would give real nice inputs/hints as a point to start with.

Again - just my take. And being so delayed, feel free to ignore.

-
Shreyansh
  
Varghese, Vipin Jan. 4, 2019, 7:01 a.m. UTC | #2
Hi Shreyansh jain,

Thanks for the comment, please give me time to look into other comments.

snipped
> > +    -  Are only specific MAC are received? rte_eth_promiscuous_get()
>                                 ^^^^^^^
>                            can be removed
> 
> What about checking vlan-filters, if any? - that is, if packets have some vlan
> stamped and hardware is configured to filter that.
This is excellent thought; can we work together to add this part once the baseline is added to the guide?

> 
> > +
> > +#. are there NIC specific drops?
> > +    -  Check rte_eth_rx_queue_info_get() for nb_desc, scattered_rx,
> > +    -  Check rte_eth_dev_stats() for Stats per queue
> > +    -  Is stats of other queues shows no change via
> > +       rte_eth_dev_dev_rss_hash_conf_get()
> 
>            Does stat for other queues shows no change via ...
> Or, maybe you intend to say - check stats for all queues in case RSS is
> configured and packets are being distributed to a queue not being listened to.
Yes.

snipped
> > +#. Are we are still seeing low performance
> > +    -  Check if sufficient objects in mempool by rte_mempool_avail_count()
> > +    -  Is failure in some pkt? we might be getting pkts with size > mbuf
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 1) Use packets rather than pkts or pkt
> 2) Did you mean "Is there failure in Rx of packets?" Or, "Packet drops
> being observed?"
> > +       data size. Check rte_pktmbuf_is_continguous()
>                              ^^^^^^^^^^^^^^^^^^^^^^^^^
> I think you intended: rte_pktmbuf_is_contiguous()
Well during field deployment with customer, failing to allow multi-segment has led to drops in virtio and NIC PMD. I will try to rephrase this in a better way.

> 
> > +    -  If user pthread is used to access object access
> > +       rte_mempool_cache_create()
> 
> Maybe I didn't get the point - but, why using rte_mempool_cache_create
> helps in case of external pthreads? Using cache would be applicable in
> all cases (assuming mempool_create(..cache_size=0..).
The API ' rte_mempool_cache_create' is being created for non-eal threads. In use case scenarios where pthreads are created either before or after 'rte_eal_init', it would be useful to use this to isolate the defect is caused from starving the objects in master mempool. 

snipped
> > +
> > +.. note::
> > +  Need to explore:
> 
> Is this section for user to explore? Or, maybe it is a residual from
> your notes?
As shared below, these would other possible areas for the user to explore in continuation of above.

snipped
> 
> 
> Frankly, it is a good attempt, but I think this is more in shape of
> notes right now.
> 
> I do understand that it is never going to be an easy task to list
> troubleshooting steps - that is fairly dependent on env and use-case.
> But, if maybe you can rephrase and add some context to some sections, it
> would give real nice inputs/hints as a point to start with.
This is an attempt to capture and share points for debug and troubleshoot. As rightly pointed out this will be varied for different use cases and scenarios.

In my humble opinion an attempt is made to start with somewhere. If there more scenario and use cases that community can be benefited, we (dpdk community) is free to add the same.

> 
> Again - just my take. And being so delayed, feel free to ignore.
Thanks for understanding. But I will try to accommodate. 

> 
> -
> Shreyansh
  

Patch

diff --git a/doc/guides/howto/debug_troubleshoot_guide.rst b/doc/guides/howto/debug_troubleshoot_guide.rst
new file mode 100644
index 000000000..55589085e
--- /dev/null
+++ b/doc/guides/howto/debug_troubleshoot_guide.rst
@@ -0,0 +1,351 @@ 
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+.. _debug_troubleshoot_via_pmd:
+
+Debug & Troubleshoot guide via PMD
+==================================
+
+DPDK applications can be designed to run as single thread simple stage to
+multiple threads with complex pipeline stages. These application can use poll
+mode devices which helps in offloading CPU cycles. A few models are
+
+  *  single primary
+  *  multiple primary
+  *  single primary single secondary
+  *  single primary multiple secondary
+
+In all the above cases, it is a tedious task to isolate, debug and understand
+odd behaviour which can occurring random or periodic. The goal of guide is to
+share and explore a few commonly seen patterns and behaviour. Then isolate and
+identify the root cause via step by step debug at various processing stages.
+
+Application Overview
+--------------------
+
+Let us take up an example application as reference for explaining issues and
+patterns commonly seen. The sample application in discussion makes use of
+single primary model with various pipeline stages. The application uses PMD
+and libraries such as service cores, mempool, pkt mbuf, event, crypto, QoS
+and eth.
+
+The overview of an application modeled using PMD is shown in
+:numref:`dtg_sample_app_model`.
+
+.. _dtg_sample_app_model:
+
+.. figure:: img/dtg_sample_app_model.*
+
+   Overview of pipeline stage of an application
+
+Bottleneck Analysis
+-------------------
+
+To debug the bottleneck and performance issues the desired application
+is made to run in an environment matching as below
+
+#. Linux 64-bit|32-bit
+#. DPDK PMD and libraries are used
+#. Libraries and PMD are either static or shared. But not both
+#. Machine flag optimizations of gcc or compiler are made constant
+
+Is there mismatch in packet rate (received < send)?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RX Port and associated core :numref:`dtg_rx_rate`.
+
+.. _dtg_rx_rate:
+
+.. figure:: img/dtg_rx_rate.*
+
+   RX send rate compared against Received rate
+
+#. are generic configuration correct?
+    -  What is port Speed, Duplex? rte_eth_link_get()
+    -  Is packet of higher sizes are dropped? rte_eth_get_mtu()
+    -  Are only specific MAC are received? rte_eth_promiscuous_get()
+
+#. are there NIC specific drops?
+    -  Check rte_eth_rx_queue_info_get() for nb_desc, scattered_rx,
+    -  Check rte_eth_dev_stats() for Stats per queue
+    -  Is stats of other queues shows no change via
+       rte_eth_dev_dev_rss_hash_conf_get()
+    -  Check if port offload and queue offload matches.
+
+#. If problem still persists, this might be at RX lcore thread
+    -  Check if RX thread, distributor or event rx adapter is holding or
+       processing more than required
+    -  try using rte_prefetch_non_temporal() to intimate the mbuf in pulled
+       to cache for temporary
+
+
+Are there packet drops (receive|transmit)?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RX-TX Port and associated cores :numref:`dtg_rx_tx_drop`.
+
+.. _dtg_rx_tx_drop:
+
+.. figure:: img/dtg_rx_tx_drop.*
+
+   RX-TX drops
+
+#. at RX
+    -  Get the rx queues by rte_eth_dev_info_get() for nb_rx_queues
+    -  Check for miss, errors, qerros by rte_eth_dev_stats() for imissed,
+       ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count
+
+#. at TX
+    -  Are we doing in bulk to reduce the TX descriptor overhead?
+    -  Check rte_eth_dev_stats() for oerrors, qerros, rte_mbuf_ref_count
+    -  Is the packet multi segmented? Check if port and queue offlaod is set.
+
+Are there object drops in producer point for ring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Producer point for ring :numref:`dtg_producer_ring`.
+
+.. _dtg_producer_ring:
+
+.. figure:: img/dtg_producer_ring.*
+
+   Producer point for Rings
+
+#. Performance for Producer
+    -  Fetch the type of RING 'rte_ring_dump()' for flags (RING_F_SP_ENQ)
+    -  If '(burst enqueue - actual enqueue) > 0' check rte_ring_count() or
+       rte_ring_free_count()
+    -  If 'burst or single enqueue is 0', then there is no more space check
+       rte_ring_full() or not
+
+Are there object drops in consumer point for ring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Consumer point for ring :numref:`dtg_consumer_ring`.
+
+.. _dtg_consumer_ring:
+
+.. figure:: img/dtg_consumer_ring.*
+
+   Consumer point for Rings
+
+#. Performance for Consumer
+    -  Fetch the type of RING – rte_ring_dump() for flags (RING_F_SC_DEQ)
+    -  If '(burst dequeue - actual dequeue) > 0' for rte_ring_free_count()
+    -  If 'burst or single enqueue' always results 0 check the ring is empty
+       via rte_ring_empty()
+
+Is packets or objects are not processed at desired rate?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Memory objects close to NUMA :numref:`dtg_mempool`.
+
+.. _dtg_mempool:
+
+.. figure:: img/dtg_mempool.*
+
+   Memory objects has to be close to device per NUMA
+
+#. Is the performance low?
+    -  Are packets received from multiple NIC? rte_eth_dev_count_all()
+    -  Are NIC interfaces on different socket? use rte_eth_dev_socket_id()
+    -  Is mempool created with right socket? rte_mempool_create() or
+       rte_pktmbuf_pool_create()
+    -  Are we seeing drop on specific socket? It might require more
+       mempool objects; try allocating more objects
+    -  Is there single RX thread for multiple NIC? try having multiple
+       lcore to read from fixed interface or we might be hitting cache
+       limit, so Increase cache_size for pool_create()
+
+#. Are we are still seeing low performance
+    -  Check if sufficient objects in mempool by rte_mempool_avail_count()
+    -  Is failure in some pkt? we might be getting pkts with size > mbuf
+       data size. Check rte_pktmbuf_is_continguous()
+    -  If user pthread is used to access object access
+       rte_mempool_cache_create()
+    -  Try using 1GB huge pages instead of 2MB. If there is difference,
+       try then rte_mem_lock_page() for 2MB pages
+
+.. note::
+  Stall in release of MBUF can be because
+
+  *  Processing pipeline is too heavy
+  *  Number of stages are too many
+  *  TX is not transferred at desired rate
+  *  Multi segment is not offloaded at TX device.
+  *  Application misuse scenarios can be
+      -  not freeing packets
+      -  invalid rte_pktmbuf_refcnt_set
+      -  invalid rte_pktmbuf_prefree_seg
+
+Is there difference in performance for crypto?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Crypto device and PMD :numref:`dtg_crypto`.
+
+.. _dtg_crypto:
+
+.. figure:: img/dtg_crypto.*
+
+   CRYPTO and interaction with PMD device
+
+#. are generic configuration correct?
+    -  Get total crypto devices – rte_cryptodev_count()
+    -  Cross check SW or HW flags are configured properly
+       rte_cryptodev_info_get() for feature_flags
+
+#. If enqueue request > actual enqueue (drops)?
+    -  Is the queue pair setup for proper node
+       rte_cryptodev_queue_pair_setup() for socket_id
+    -  Is the session_pool created from same socket_id as queue pair?
+    -  Is enqueue thread same socket_id?
+    -  rte_cryptodev_stats() for drops err_count for enqueue or dequeue
+    -  Are there multiple threads enqueue or dequeue from same queue pair?
+
+#. If enqueue rate > dequeue rate?
+    -  Is dequeue lcore thread is same socket_id?
+    -  If SW crypto is in use, check if the CRYPTO Library build with
+       right (SIMD) flags Or check if the queue pair using CPU ISA by
+       rte_cryptodev_info_get() for feature_flags for AVX|SSE
+    -  If we are using HW crypto – Is the card on same NUMA socket as
+       queue pair and session pool?
+
+worker functions not giving performance?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Custom worker function :numref:`dtg_distributor_worker`.
+
+.. _dtg_distributor_worker:
+
+.. figure:: img/dtg_distributor_worker.*
+
+   Custom worker function performance drops
+
+#. Performance
+    -  Threads context switches more frequently? Identify lcore with
+       rte_lcore() and lcore index mapping with rte_lcore_index(). Best
+       performance when mapping of thread and core is 1:1.
+    -  Check lcore role type and state? rte_eal_lcore_role for
+       rte, off and service. User function on service core might be
+       sharing timeslots with other functions.
+    -  Check the cpu core? check rte_thread_get_affinity() and
+       rte_eal_get_lcore_state() for run state.
+
+#. Debug
+    -  Mode of operation? rte_eal_get_configuration() for master, fetch
+       lcore|service|numa count, process_type.
+    -  Check lcore run mode? rte_eal_lcore_role() for rte, off, service.
+    -  process details? rte_dump_stack(), rte_dump_registers() and
+       rte_memdump() will give insights.
+
+service functions are not frequent enough?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+service functions on service cores :numref:`dtg_service`.
+
+.. _dtg_service:
+
+.. figure:: img/dtg_service.*
+
+   functions running on service cores
+
+#. Performance
+    -  Get service core count? rte_service_lcore_count() and compare with
+       result of rte_eal_get_configuration()
+    -  Check if registered service is available?
+       rte_service_get_by_name(), rte_service_get_count() and
+       rte_service_get_name()
+    -  Is given service running parallel on multiple lcores?
+       rte_service_probe_capability() and rte_service_map_lcore_get()
+    -  Is service running? rte_service_runstate_get()
+
+#. Debug
+    -  Find how many services are running on specific service lcore by
+       rte_service_lcore_count_services()
+    -  Generic debug via rte_service_dump()
+
+Is there bottleneck in eventdev?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. are generic configuration correct?
+    -  Get event_dev devices? rte_event_dev_count()
+    -  Are they created on correct socket_id? - rte_event_dev_socket_id()
+    -  Check if HW or SW capabilities? - rte_event_dev_info_get() for
+       event_qos, queue_all_types, burst_mode, multiple_queue_port,
+       max_event_queue|dequeue_depth
+    -  Is packet stuck in queue? check for stages (event qeueue) where
+       packets are looped back to same or previous stages.
+
+#. Performance drops in enqueue (event count > actual enqueue)?
+    -  Dump the event_dev information? rte_event_dev_dump()
+    -  Check stats for queue and port for eventdev
+    -  Check the inflight, current queue element for enqueue|deqeue
+
+How to debug QoS via TM?
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+TM on TX interface :numref:`dtg_qos_tx`.
+
+.. _dtg_qos_tx:
+
+.. figure:: img/dtg_qos_tx.*
+
+   Traffic Manager just before TX
+
+#. Is configuration right?
+    -  Get current capabilities for DPDK port rte_tm_capabilities_get()
+       for max nodes, level, shaper_private, shaper_shared, sched_n_children
+       and stats_mask
+    -  Check if current leaf are configured identically rte_tm_capabilities_get()
+       for lead_nodes_identicial
+    -  Get leaf nodes for a dpdk port - rte_tn_get_number_of_leaf_node()
+    -  Check level capabilities by rte_tm_level_capabilities_get for n_nodes
+        -  Max, nonleaf_max, leaf_max
+        -  identical, non_identical
+        -  Shaper_private_supported
+        -  Stats_mask
+        -  Cman wred packet|byte supported
+        -  Cman head drop supported
+    -  Check node capabilities by rte_tm_node_capabilities_get for n_nodes
+        -  Shaper_private_supported
+        -  Stats_mask
+        -  Cman wred packet|byte supported
+        -  Cman head drop supported
+    -  Debug via stats - rte_tm_stats_update() and rte_tm_node_stats_read()
+
+Packet is not of right format?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Packet capture before and after processing :numref:`dtg_pdump`.
+
+.. _dtg_pdump:
+
+.. figure:: img/dtg_pdump.*
+
+   Capture points of Traffic at RX-TX
+
+#.  with primary enabling then secondary can access. Copies packets from
+    specific RX or TX queues to secondary process ring buffers.
+
+.. note::
+  Need to explore:
+    *  if secondary shares same interface can we enable from secondary
+       for rx|tx happening on primary
+    *  Specific PMD private data dump the details
+    *  User private data if present, dump the details
+
+How to develop custom code to debug?
+------------------------------------
+
+-  For single process - the debug functionality is to be added in same
+   process
+-  For multiple process - the debug functionality can be added to
+   secondary multi process
+
+.. note::
+
+  Primary's Debug functions invoked via
+    #. Timer call-back
+    #. Service function under service core
+    #. USR1 or USR2 signal handler
+
diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
index a642a2be1..9527fa84d 100644
--- a/doc/guides/howto/index.rst
+++ b/doc/guides/howto/index.rst
@@ -18,3 +18,4 @@  HowTo Guides
     virtio_user_as_exceptional_path
     packet_capture_framework
     telemetry
+    debug_troubleshoot_guide