Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 41:ae51818f61c8
Perf Tuning -- first full draft of theory section at end
author: Some Random Person <seanhalle@yahoo.com>
date: Fri, 04 May 2012 06:40:58 -0700
parents: cbefaa3eda37
children: a4e0504b60f6
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 1 files changed, 32 insertions(+), 26 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 58 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 58
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 58
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri May 04 05:26:00 2012 -0700
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri May 04 06:40:58 2012 -0700
     1.3 @@ -312,55 +312,43 @@
     1.4  
     1.5  \section{Introduction and Motivation}
     1.6  
     1.7 -Performance visualizations and tuning tools for parallel programs leave something to be desired, and yet are critical to achieving good performance. They fail mainly because their displays of measurements  leave the user to \emph{guess} the \emph{cause} of performance loss.  They    provide many different views of performance measurements,  like statistics by line of code, message sends and receives, and core usage timelines. But the user doesn't know why a function runs in a particular spot on a particular core, nor whether that  is desired behavior vs erroneous behavior.
     1.8 +Performance visualizations and tuning tools for parallel programs leave something to be desired, and yet are critical to achieving good performance. They fail mainly because their displays of measurements  leave the user to \emph{guess} the cause of performance loss.  They    provide many different views of performance measurements,  like statistics by line of code, message sends and receives, and core usage timelines. But the user doesn't know why a function runs in a particular spot on a particular core, nor whether that  is desired behavior vs erroneous behavior.
     1.9  
    1.10  
    1.11  It's like the allegory of the five blind people and the elephant: one touches the ear and says it's round and flat, another touches the leg and says it's a tree, and so on. They each are correct, but the views don't connect to tell them the whole picture. 
    1.12  
    1.13 -It is the parallel aspects of code and runtime decisions that the tools fall short on.
    1.14 -The user usually knows what the application is doing semantically, but parallel performance is determined by scheduling choices and other runtime behavior.  The decision about which task or virtual processor is assigned to which core at what point in time is the heart of parallel performance. That is the information missing: the causes of that behavior, which the user wants to know.
    1.15 +Current tools cover all the parts of the application code, but fail to adequately connect these to the runtime behavior, scheduling decisions, and consequent hardware behavior.  The decision about which task or virtual processor is assigned to which core at what point in time is the heart of parallel performance, so the tool needs to connect such choices and behaviors to application features. That is the information missing: the causes of the observed behavior, and how those causes tie to application features.
    1.16  
    1.17  
    1.18 -To fix this, a framework is needed to connect the measurements together.  It  should be in terms of scheduling decisions, with the units the decisions are made on, and the various sources of constraints on those decisions. 
    1.19 +To fix this, a model is needed that provides the linkage to connect the measurements together.  It  should be in terms of scheduling decisions, having the units the decisions are made on, and the various sources of constraints on the decisions. 
    1.20  
    1.21 -Our contribution is such a framework. Using it, we generate views that indicate the schedulable units in the code. and all the constraints on scheduling them. These include constraints imposed by the application, scheduler implementation  and hardware details. 
    1.22 +Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, and all the constraints on scheduling them. These include constraints imposed by the application, the runtime implementation  and hardware details. 
    1.23  
    1.24 -The views connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They should also separate resource usage into categories of: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They should also integrate parameter choices within the code that affect unit creation and constraints.
    1.25 +The views connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They separate resource usage into the categories: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They also integrate parameter choices, within the code, that affect unit creation and constraints.
    1.26  
    1.27 -=====
    1.28   
    1.29 - The value of a consequence graph is linking the size of boxes in it to the decisions made by the scheduler, which are represented by the shape. This lets a person visually link scheduling decisions to consequences.  The mind can quickly see empty areas, representing lost performance, and connect those to boxes and lines that cause the empty areas.  They then look up the code of each box and construct of each line, which gives them the code structures that caused the loss of performance.  This \emph{visual} linkage of undesirable effect back to \emph{cause in the code} is what sets our approach apart. 
    1.30 -The value of the UCC is visually linking cause of performance loss to options available for fixing it.
    1.31  
    1.32 -=====  
    1.33  
    1.34 +We describe our model of computation, which relates the aspects of performance, and the instrumentation and visualization that are guided by that model and collect performance tuning information and link it together. The model and visualization are illustrated with a story line, which shows how they are used to performance tune a standard parallel application on a challenging multi-core system. 
    1.35  
    1.36 -We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune a standard parallel application on two multi-core systems. 
    1.37 +We start with background on performance tuning and previous approaches to tools in Section X. Then...
    1.38  
    1.39  
    1.40  \section{Background and Related Work}
    1.41 -Performance tuning has steps that are iterated until the person tuning is satisfied. First take measurements and display them, in order to discover discrepancies from desired behavior. Next, connect the details of the discrepancies with structure information to form a hypothesis for the cause of the discrepancy. The cause should suggest a plan to fix the problem. Then implement the plan, re-execute, gather new measurements, and repeat until satisfied.
    1.42 +Performance tuning follows the standard iterative approach. In each step of iteration, first measurements of performance are taken and compared to what was desired. Some mental model is  used to generate a hypothesis of why the measurement was less than desired. A mental model is then used to link the hypothesis to things within the programmer's control,  to suggest a change to make to the code. The modified code is then run again and the iteration-step repeats until the person doing the tuning is satisfied.
    1.43  
    1.44 -For parallel performance, the information presented must include task/work-unit, constraints on scheduling those, and scheduling decisions actually made.
    1.45 + 
    1.46  
    1.47 -Most of the older more established tools come from the threads world, and conceive of the application as a processor that performs actions, but don't include the concept of application-defined tasks nor constraints on them. For example, 
    1.48 +For performance of parallel computation, more than just runtime measurements are important. Context for those measurements must also be gathered, which includes the characteristics of the units of work that are scheduled onto resources, as well as the  constraints on scheduling those. Then during the run, not only is hardware usage measured but also the scheduling decisions actually made and carried out.
    1.49  
    1.50 +Most of the older more established tools come from the threads world, and conceive of the application as a processor that performs actions, but don't include the concept of application-defined tasks nor constraints on them. This makes them unable to directly connect statistics they gather to application features.  The lack of connection forces the user to guess at what aspect of the code is responsible for observed performance.
    1.51  
    1.52 -Their structure information is either call graph or line of code. This is too limited for parallel performance. Need tasks and scheduling. More recent languages move towards task-based. This makes more of the needed structure information available. But we've not seen a complete or coherent presentation.
    1.53  
    1.54 -===========
    1.55 +For example, Tau is a highly cited older system for performance tuning parallel applications, which is representative of thread-centric approaches. It integrates many data sources, and has rich displays. However its model was cores and memories and thread contexts, with actions taken on or by each. It had no well defined concept of scheduling, unit scheduled, nor constraints on scheduling those units. Hence, it had no view that integrated the parallelism-specific information at the heart of performance:  tasks, constraints on them, and scheduling choices.
    1.56  
    1.57 -We've had basic perf tuning, then Tau, now 
    1.58 +Another highly cited classic performance tuning system is Paradyn[], which represents . It is meant for applications that run for several days on multi-thousand node clusters. Its model of computation is based on events, both the timing of events and counts of events. It has a system for user-supplied instrumentation to collect event information and it has a hypotheses mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events. not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. 
    1.59  
    1.60 -A survey of the most highly cited classic papers shows the commonality..
    1.61  
    1.62 -===========
    1.63 -
    1.64 -One of the most cited examples of the classic performance tuning systems is  Tau[]. (Threads, thousands of data sources, no coherent inter-relation of what's happening) It integrates many data sources, with rich displays. However its model was cores and memories and contexts, with actions taken on or by each. It had no well defined concept of scheduling, runtime nor units of work (tasks). Hence, it had no view that integrated the parallelism-specific information about tasks, constraints on them, and scheduling choices.
    1.65 -
    1.66 -Another highly cited classic performance tuning system is Paradyn[]. It is meant for applications that run for several days on multi-thousand node clusters. Its model of computation is based on events, both the timing of events and counts of events. It has a system for user-supplied instrumentation to collect event information, and, it has a hypotheses mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events. not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. (an attempt at codifying this hypothesis and test approach to perf tuning). 
    1.67 -
    1.68 -The second most cited is an overview paper from 1991.
    1.69  
    1.70  Paragraph instruments MPI library. So it's an event-based model -- for cores only tells if busy, communication specific overhead, or idle.
    1.71  
    1.72 @@ -385,6 +373,23 @@
    1.73  
    1.74  =======================?
    1.75  
    1.76 +Their structure information is either call graph or line of code. This is too limited for parallel performance. Need tasks and scheduling. More recent languages move towards task-based. This makes more of the needed structure information available. But we've not seen a complete or coherent presentation.
    1.77 +
    1.78 +=====
    1.79 + 
    1.80 + The value of a consequence graph is linking the size of boxes in it to the decisions made by the scheduler, which are represented by the shape. This lets a person visually link scheduling decisions to consequences.  The mind can quickly see empty areas, representing lost performance, and connect those to boxes and lines that cause the empty areas.  They then look up the code of each box and construct of each line, which gives them the code structures that caused the loss of performance.  This \emph{visual} linkage of undesirable effect back to \emph{cause in the code} is what sets our approach apart. 
    1.81 +The value of the UCC is visually linking cause of performance loss to options available for fixing it.
    1.82 +
    1.83 +===== 
    1.84 +
    1.85 +
    1.86 +We've had basic perf tuning, then Tau, now 
    1.87 +
    1.88 +A survey of the most highly cited classic papers shows the commonality..
    1.89 +
    1.90 +===========
    1.91 +
    1.92 +
    1.93  Use case: takes longer to create when more units -- that's because it mallocs all the matrix pieces, before it pairs any and starts the pair as work -- can see this from the graphs by linking the units to code they execute -- not the functions, specifically, but to the scheduled code-snippet. -- suggests changing the way divider works.
    1.94  
    1.95  Graph tells us the division code is bad -- works better when have 39 cores instead of 40 -- makes more pieces for 39 than for 40, and the end up fitting better onto the cores..
    1.96 @@ -522,7 +527,7 @@
    1.97  
    1.98  \begin{figure}[b]
    1.99    \begin{minipage}[b]{0.2\textwidth}
   1.100 - 	\subfloat[Original]
   1.101 +        \subfloat[Original]
   1.102      {\quad\quad \includegraphics[scale=0.015]{../figures/192.pdf} \quad
   1.103      \label{story:a}}\quad
   1.104    \end{minipage}
   1.105 @@ -825,3 +830,4 @@
   1.106  %% EOF ieeepes_skel.tex
   1.107  %%----------------------------------------------------------------------
   1.108  
   1.109 + destroying virtutal processors, and three kinds of send-receive pairs. The 
   1.110 \ No newline at end of file