changeset 52:4433e4c6a758

perf-tuning: rewrite of implementation done -- needs to be cut-down now
author Sean Halle <seanhalle@yahoo.com>
date Sun, 03 Jun 2012 15:54:25 -0700
parents b42c7d2948ac
children a358d611a1a7
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 1 files changed, 89 insertions(+), 15 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sun Jun 03 10:54:02 2012 -0700
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sun Jun 03 15:54:25 2012 -0700
     1.3 @@ -284,7 +284,7 @@
     1.4  
     1.5  
     1.6  %MOIRAI: MOdel for Integrated Runtime Analysis through Instrumentation
     1.7 -\title{Integrated Performance Tuning Using Semantic Information Collected by Instrumenting the Language Runtime}
     1.8 +\title{Integrated Performance Tuning Using Construct Information Collected by Instrumenting the Language Runtime}
     1.9  
    1.10  \author{
    1.11          Nina Engelhardt\\
    1.12 @@ -701,7 +701,9 @@
    1.13  
    1.14  Each  unit has a life-line, which progresses so:  creation of the meta-unit \pointer , state updates that affect constraints on the unit \pointer,   the decision is made to animate the unit  \pointer, movement of the meta-unit plus data to physical resources that do the animation \pointer  , animation of the unit, which does the work \pointer,  communication of state-update, that unit has completed, and hardware is free \pointer ,  constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
    1.15  
    1.16 - Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, happening before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
    1.17 + Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
    1.18 +
    1.19 +Also, note that this explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.  The visualizations are based on these concepts, and so likewise largely remain the same.  In the UCC, only the constraint patterns that represent  the language's constructs change between languages. In the CG, only which construct a line in the CG represents changes.
    1.20  
    1.21  \subsection{Mapping model onto implementation details in runtime}
    1.22  
    1.23 @@ -709,28 +711,102 @@
    1.24  
    1.25  However, the CG is not a strict expression of the model, rather it's purpose is practical. It shows usage of the cores, and relates that to the quantities in the model. Hence, the measurements for the CG all are boundaries, where the core's time switches from one category in the model to a different.
    1.26  
    1.27 -This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the CG doesn't measure the communication time, rather it captures idleness of the core caused by the non-overlapped portion of that communication. 
    1.28 +This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the CG doesn't measure the communication time directly, rather it captures idleness of the core caused by the non-overlapped portion of that communication. 
    1.29  
    1.30 -The reason for the difference is that the CG is intended to highlight the quantity of most interest to application coders, which is accounting for core usage and assigning each idle period to a cause. The assignments of units to cores determine the source and destination of communications. Hence, non-overlapped communication idle periods are consequences of the  assignment choices made by the scheduler.  This, by they way, leads to the name: scheduling consequence graph.  
    1.31 +The reason for the difference is that the CG is intended to highlight the quantity of most interest to application coders, which is accounting for core usage and assigning each idle period to a cause. The choice of units to cores is what determines the source and destination of communications. Hence, non-overlapped communication idle periods are consequences of the  assignment choices made by the scheduler.  This, by the way, leads to the name: scheduling consequence graph.  
    1.32  
    1.33   
    1.34  
    1.35  What must be collected during the run differs between the two types of visualization. For the UCC it is unit boundaries and the constraints related to each unit.
    1.36 -For the CG, the same units must  be collected, but also the time a core spends on each segment of the unit's life-line.  However, implementation details of the runtime may cause things such as idling the core during lock acquisition to be counted as time spent on one of a unit's life segments. What core activities go to which life segments changes from runtime to runtime. 
    1.37 +For the CG, the same units must  be collected, but also the time a core spends on each segment of the unit's life-line.  Also, implementation details of the runtime will cause things such as idling the core during lock acquisition to be counted towards a unit's life segment. What core activities go to which life segments changes from runtime to runtime. 
    1.38  
    1.39 -The CG also represents each  cause of a transition from one segment of core usage to another. Such a causation is always a causal dependency of some kind, even ones that correspond to complex constraint constructs in the application. These causations are collected and  tied to one of: construct constraint, runtime internal constraint, or hardware constraint. In this paper, all are collected, but the only causations visualized are construct constraints that cross cores, with propendent on one core and its dependent on another.
    1.40 +The SCG also represents each  cause of a transition from one segment of core usage to another, as an arc between boxes. Such a causation is always a causal dependency of some kind, even if it corresponds to a complex construct in the application. These causations are collected and  tied to one of: construct constraint, runtime internal constraint, or hardware constraint. In this paper, all are collected, but the only causations visualized are constructs  that cross cores, with propendent on one core and its dependent on another.
    1.41  
    1.42  \subsection{Instrumenting our implementation of SSR on top of VMS}
    1.43  
    1.44 -We used a version of SSR implemented on top of a proto-runtime system called VMS.  This proto-runtime embodies most of a runtime implementation, but has replaced two key portions with interfaces. Those portions are the handling of language construct constraints and the decision of which core to assign a unit to. To implement a language, one simply supplies those two portions of code, via the interface.
    1.45 +We instrumented a version of SSR implemented on top of a proto-runtime system called VMS.  This proto-runtime embodies most of a runtime implementation, but has replaced two key portions with interfaces. Those portions are the handling of language construct-constraints and the decision of which core to assign a unit to. To implement a language, one simply supplies those two portions of code, via the interface.
    1.46  
    1.47 -VMS also has the advantage for our approach of being written in accordance with the computation model, which makes instrumenting it especially convenient. 
    1.48 +VMS also has the advantage for our approach of being written in accordance with the computation model, which makes instrumenting it especially convenient. Each language construct has its own handler into which to insert measurement code, and transitions in unit life-lines also have convenient locations in VMS to insert instrumentation code.
    1.49  
    1.50 -Collecting a unit: A unit begins by transitioning out of the runtime to start the unit's code, and ends by leaving the unit's code back to the runtime. These transitions both have a convenient spot in VMS at which to insert the code that records the units. 
    1.51 +\subsubsection{SSR background}
    1.52 +A distinction important in understanding SSR and other parallel languages is being task-based versus virtual processor (VP) based. Task-based languages include dataflow, CnC, and StarSs.  These tasks don't suspend and resume, but rather execute to completion. Hence, such a task is the same as our definition of unit. They have no state that persists across calls to the runtime. In contrast, a virtual processor does suspend and resume and so has state that persists across runtime calls. Examples include pthreads, OpenMP thread-based constructs, UPC, and so on.
    1.53  
    1.54 -Collecting the constraints for UCC: In VMS, each language construct has its own handler. Each handler has code inserted to record which unit invoked the construct. This is how the constraints on units are collected for the UCC.
    1.55 +SSR is based on virtual processors. They execute sequential code that occasionally calls a parallel construct, which suspends the VP and switches to the runtime. This means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls.
    1.56 + 
    1.57  
    1.58 -Those two make up all that is needed to gather the UCC. For the CG, 
    1.59 +SSR has both deterministic constraints, which specify the source and destination VP, such as send\_from\_to, and non-deterministic ones, in which the runtime is what chooses which VPs interact, such as send\_to\_of\_type and singleton. Deterministic ones display the same in the UCC and the SCG. However, non-deterministic constraints need all possibilities to be determined for the UCC. Only the actually realized choice is used in the SCG.
    1.60 +
    1.61 +\subsubsection{Collecting a unit}
    1.62 +A unit begins by transitioning out of the runtime and into the unit's application code. It ends by leaving the unit's code back into the runtime. These transitions both have a convenient spot in VMS at which to insert the code that records the units. 
    1.63 +
    1.64 +\subsubsection{Collecting the constraints} In VMS, each language construct has its own handler. Into each handler, code is inserted to record which unit invoked the construct, and any units freed by it. The SCG needs  to link the unit that made a construct call to the unit freed by that call.
    1.65 +
    1.66 +What information needs to be collected for SCG and UCC and how it is done depends on the construct:
    1.67 +\begin{itemize}
    1.68 +\item create\_VP: The creation of a new VP implies a simple dependency: the first unit in the new VP may only execute after the  creating unit ends via the create call. We place code into the create\_VP\ handler, which records  the calling VP\ + unit, along with the newly created unit, and the VP it is assigned to.
    1.69 +\item send\_from\_to and receive\_from\_to: These constructs are rendezvous based, meaning that the unit following the call can only begin after \emph{both} VPs involved have completed the call. We represent this by two crossing dependencies. The constructs are deterministic, so both the UCC and SCG use the same recording. Code is placed into both handlers at the point that checks if both the rendez-vous requests are present. When true, it records both the unit+VPs that connected.
    1.70 +\item Send\_to\_of\_type and receive\_to\_of\_type: These are also rendezvous based, but the pairing of sender and receiver is not deterministic. The SCG   represents the decision made inside the runtime, so inside the handler, the later of the two units connected by the runtime records the pairing.  But for the UCC, we want to capture all sending and receiving permutations available, so we add code that collects the group of senders and the corresponding group of receivers.
    1.71 +\item Singleton: For the singleton construct, there is a single unit with a group of predecessor units and a group of successor units. The first predecessor to complete enables the singleton unit. All successors must wait for completion of the singleton. We insert code into the handler that records the predecessor that enabled the singleton, which is all that the SCG needs. For the UCC, we add code inside the call to start the singleton. It records the unit that the call ended, and the successor-unit it causes to start.
    1.72 +
    1.73 +\end{itemize}
    1.74 +
    1.75 +\subsubsection{Recording time, instructions, and cache misses }
    1.76 + Just recording the units and connections between them is not enough. Because the SCG represents core usage, it needs  the time spent on each activity, including internal runtime activities, to be recorded. Each interval of time is assigned to a  segment of a particular unit's life-line.
    1.77 +
    1.78 +The UCC makes use  of the number of instructions in a unit, as an estimate of size of work in the unit, as illustrated by Fig [UCC same-sz vs UCC instr-sz]. Without knowing the relative size of the units, it is hard to estimate the amount of parallelism \emph{usefully} available in the application.
    1.79 +
    1.80 +To measure the instructions, cycles, and communication (cache misses), we use the hardware performance counters. Readings are inserted into the runtime code to capture core time spent on each segment of the life-line of a unit: 
    1.81 +\begin{enumerate}
    1.82 +\item Create meta-unit: In VMS, this is measuring the time the create\_VP construct handler consumes on the core. 
    1.83 +\item Update constraints: In VMS, this is the time spent inside the language-supplied construct-handler function.
    1.84 +\item Decision to animate: This is the time spent inside the language-supplied assigner function.
    1.85 +\item Move meta-unit to core: This is via shared variables, recorded as part of 3.
    1.86 +\item Move work data to core: This is via cache misses, recorded as part of 6.
    1.87 +\item Do the work of the unit: This is measured by instrumenting the VMS switch-to-unit primitive and the corresponding switch-to-runtime primitive. 
    1.88 +\item Communicate state update: in VMS, this is the time between leaving the application code and starting the construct handler (which includes lock acquisition).
    1.89 +\item Resulting constraint updates: in VMS, this is the time spent inside the construct handler, and is the same as 2 
    1.90 +\end{enumerate}
    1.91 +
    1.92 +In summary, to cover each of the segments of a unit's life-line, code to read  the performance counters is inserted at:
    1.93 +
    1.94 +\begin{itemize}
    1.95 +\item Construct handler: To measure 2 and 8, reading is done before and after VMS calls the language-supplied construct handler function
    1.96 +\item
    1.97 +Assigner: To measure 3 and 4, reading is done before and after VMS calls  the language-supplied assigner function.
    1.98 +\item Work: To measure 5 and 6, reading is done at the point VMS switches to the unit, and the point it switches back into the runtime.
    1.99 +
   1.100 +\item 
   1.101 +Dual-use: To measure 1, the construct handler's reading points are used. To measure 7, the reading done upon switching into runtime is coupled to the reading done just before starting the construct handler function.
   1.102 +\end{itemize}
   1.103 +
   1.104 +  
   1.105 +
   1.106 +For clarity, all but work are grouped as overhead in the visualization, but they could be displayed separately if needed.
   1.107 +
   1.108 + All the measurements are output into a trace file, which is then evaluated after the run to build the visualizations.
   1.109 +
   1.110 +\subsection{Building the Visualizations}
   1.111 +
   1.112 +Both the UCC and the SCG are represented as directed graphs, with units as nodes.
   1.113 +
   1.114 +\subsubsection{UCC}
   1.115 +For the UCC, units can be either unweighted or weighted. Weighted units appear as rectangles with height proportional to the weight, unweighted units are circles. Our implementation can use the number of instructions in the work section from a run to weigh the units. This removes some of the influence of scheduling and data, such as cache misses, but can be insufficient if the application is strongly data- or scheduling-dependent.
   1.116 +
   1.117 +Simple, deterministic dependencies are represented as arcs. Complicated constraints are for now displayed as an additional node bearing information on the constraint with incoming arcs from all units whose execution status affects the constraint and outgoing arcs to the constrained units.
   1.118 +
   1.119 +A critical path algorithm is then used to place the nodes vertically, from top to bottom. For non-deterministic constraints, it is possible to enable or disable their participation in the path. Enabling them will lead to an over-estimation of the critical path, disabling them to an under-estimation (better solutions welcome). It is also ensured that no nodes overlap.
   1.120 +
   1.121 +\subsubsection{SCG}
   1.122 +
   1.123 +For the SCG, all nodes are weighted with the number of cycles spent on the unit in total (work + overhead). Nodes are then further displayed separated into overhead and work. Because it displays a concrete run, the actual choices made for all non-deterministic constraints are available, so all dependencies are deterministic. The same critical path algorithm as for the UCC is used to place nodes vertically, but this time horizontal placement is determined by the core on which the unit was executed  (hardware dependencies ensure no overlap).
   1.124 +
   1.125 +A selection of constraints can then be overlaid, with color codes by type. In SSR we display by default creation, direct and typed message sending (but not the crossing dependency from the receiver back to the sender), and singleton outgoing dependencies; but each type can be individually hidden or shown.
   1.126 +
   1.127 +All this information is taken purely from the runtime, leading to a rich, configurable visualization without needing to add anything to the application.
   1.128 +
   1.129 +\section{Conclusion}
   1.130 +\label{conclusion}
   1.131 +We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: each measurement to others, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others. 
   1.132  
   1.133  ===============  
   1.134  
   1.135 @@ -783,9 +859,7 @@
   1.136  
   1.137  ?
   1.138  
   1.139 --] Highlight Benefit: model is independent of language -- it is based on life-line of a unit.  Concept of unit goes across languages, and they all have a life-line.  visualization is in terms of units and constraints, which are universal, so it remains same, across languages.  Only change is which construct a line in CG belongs to, and the UCC shape of constraint constructs. 
   1.140  
   1.141 -?
   1.142  
   1.143  Output format:   for implementer
   1.144  
   1.145 @@ -836,7 +910,7 @@
   1.146  
   1.147   The runtime checks the constraints defined by the language, performs some operations (e.g. transmitting a message) if required, and then calls the scheduler to decide upon the next VP to be run. With our definition of units as bounded by scheduling decisions, this means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls. 
   1.148  
   1.149 -There is always a ``dependency'' (constraint that only frees the dependent after the propendent finished) between two successive tasks inside a same VP, since they are part of a continuous sequential code. This is a deterministic constraint, so it is the same in the UCC and the SCG. For non-deterministic constraints, such as singleton and send\_of\_type, all possibilities need to be added to the UCC. But only the actually realized causal dependencies are used in the SCG.
   1.150 +There is always a ``dependency'' (constraint that only frees the dependent after the propendent finished) between two successive tasks inside a same VP, since they are part of a continuous sequential code. This is a deterministic constraint, so it is the same in the UCC and the SCG. For non-deterministic constraints, such as singleton and send\_of\_type, all possibilities need to be added to the UCC. But only the actually realized choice is used in the SCG.
   1.151  
   1.152  Additional constraints can be placed on the tasks depending on the construct:
   1.153  \begin{itemize}
   1.154 @@ -857,7 +931,7 @@
   1.155  \item Assign: Deciding which unit to execute next
   1.156  \item Work: Doing the work of the unit
   1.157  \item Invocation: Switching back from the work to the runtime
   1.158 -\item Request handler: Performing actions requested by the application (e.g. transmitting a message) and then updating constraint status
   1.159 +\item Construct handler: Performing actions requested by the application and  updating constraint status
   1.160  \end{itemize}
   1.161  For clarity, all but work are grouped as overhead in the visualization, but they could be displayed separately if needed.
   1.162