Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 50:9f835e24cb7a
perf tune: implementation
author: Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date: Fri, 01 Jun 2012 19:45:35 +0200
parents: 9a695032203b
children: b42c7d2948ac
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 2 files changed, 47 insertions(+), 9 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 56 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 56
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 56
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Jun 01 04:47:58 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Jun 01 19:45:35 2012 +0200
     2.3 @@ -666,24 +666,62 @@
     2.4  The last question is how to handle communication consequences. This is tricky because decisions in higher-level runtimes set the context for decisions in lower-level ones. This means a higher-level choice is linked to the consequences from lower-level choices. The value of a consequence graph is due to linking the size of boxes in it to the decisions made by the scheduler, as represented by the shape. It's not clear how to divide, among the levels, the time that cores spend waiting for non-overlapped communication. We have no good answer at the moment and leave it for future work.
     2.5  
     2.6  \section{Implementation}
     2.7 -%%how are graphs generated, what is needed
     2.8  
     2.9  The visualization relies on data collected from the runtime during execution. There are two kinds of information that need to be recorded: identification of units and constraints, and execution metric measurements.
    2.10  The first can be obtained from the language runtime at the places where constraints are checked and modified and units are created. The second have to be recorded as the unit progresses through different stages of execution.
    2.11  
    2.12 +\subsection{Units and Constraints}
    2.13 +
    2.14  As units are defined by scheduling decisions, the creation of a unit is easiest to register at the point where the unit is assigned to a processing element. This ensures that all units that are executed are recorded, and all units that are recorded are really executed. There is no significant variation in this between languages, and the units are the same for the concrete UCC and the SCG. If the language captures sufficient information to reconstruct an abstract UCC (possibly only for simple types of UCC), this information can also be captured, allowing more general analysis.
    2.15  Language constructs specify constraints on units. The connections between units can be very complex depending on the language, so the instrumentation needs to be tailored to the constructs.
    2.16 -In SSR, we have several constructs, all of which simultaneously mark boundaries between tasks:
    2.17 -\begin{description}
    2.18 -\item[Create VP] The creation of a new VP creates a simple dependency: the first task in the new VP may only execute after the creating task has finished.
    2.19 -\item[Simple send and receive] Send to and receive from a specific VP is rendez-vous based, so that the units following the communication in both VPs can only execute after the units preceding the rendevouz point in both VPs have finished. This can easily be represented by two crossing dependencies. These are deterministic, so the record is the same for the UCC or the SCG.
    2.20 -\item[Typed send and receive] Typed send/receive is also rendez-vous based, but contrary to simple send/receive, the pairing of sender and receiver is not deterministic. For the SCG, which represents a specific run, the actual communications observed can be recorded in the same way as simple send/receive, but for the UCC, we want to capture all sending and receiving permutations available. In this case, since the construct specifies no further constraints beyond the type of the message, we simply record for each type a group of senders and a group of receivers.
    2.21 -\item[Singleton] For the singleton construct, the units and constraints are actually deterministic, only the assignment of the unit to a VP is decided dynamically.
    2.22 -\end{description}
    2.23 -Additionally, the virtual processor structure means that tasks inside a VP are sequentially dependent, i.e. the things happening in a thread after it gets suspended can only happen after the things that happened before it got suspended.
    2.24  
    2.25 +Using the concrete case of SSR, we will show the correspondence between the language's constructs and the elements of the UCC and the SCG.
    2.26  
    2.27 +SSR, like threads, is based on the concept of virtual processors (VPs). These VPs execute sequential code that occasionally makes use of parallel language constructs, which will suspend the VP and switch to the runtime. The runtime checks the constraints defined by the language, performs some operations (e.g. transmitting a message) if required, and then calls the scheduler to decide upon the next VP to be run. With our definition of units as bounded by scheduling decisions, this means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls. There is always a ``dependency'' (constraint that only frees the dependent after the propendent finished) between two successive tasks inside a same VP, since they are part of a continuous sequential code. This is a deterministic constraint, so it is the same in the UCC and the SCG. For non-deterministic constraints, all possibilities need to be added to the UCC, whereas only the actually realized causal dependencies are used to build the SCG.
    2.28  
    2.29 +Additional constraints can be placed on the tasks depending on the construct:
    2.30 +\begin{itemize}
    2.31 +\item Create VP: The creation of a new VP creates a simple dependency: the first task in the new VP may only execute after the creating task has finished.
    2.32 +\item Simple send and receive: Send to and receive from a specific VP is rendezvous based, so that the units following the communication in both VPs can only execute after the units preceding the rendezvous point in both VPs have finished. This can easily be represented by two crossing dependencies. These are deterministic, so the record is the same for the UCC or the SCG.
    2.33 +\item Typed send and receive: Typed send/receive is also rendezvous based, but contrary to simple send/receive, the pairing of sender and receiver is not deterministic. For the SCG, which represents a specific run, the actual communications observed can be recorded in the same way as simple send/receive, but for the UCC, we want to capture all sending and receiving permutations available. In this case, since the construct specifies no further constraints beyond the type of the message, we simply record for each type a group of senders and a group of receivers.
    2.34 +\item Singleton: For the singleton construct, there is a group of predecessors, the completion of one of which is sufficient to fulfil the constraint on the singleton unit. All successors have to wait for the completion of the singleton, so these are simple dependencies.
    2.35 +\end{itemize}
    2.36 +
    2.37 +\subsection{Execution measurements}
    2.38 +
    2.39 +In addition to identifying the units and the relations between them, we need to capture how much work a unit represents and how many resources (mainly core-time) it occupies.
    2.40 +
    2.41 +For performance tuning it is obviously critical to know the actual performance, but even for the UCC these measurements are important. Indeed, as evidenced by Fig [UCC with identical-size units vs UCC with units sized by instruction count], without knowing the relative size of the units, it is hard so estimate the amount of parallelism in the application.
    2.42 +
    2.43 +We use hardware performance counters to measure for each unit the work (=time spent in application code) and the overhead (=time spent in runtime) in cycles and in instructions, as well as the number of last level cache misses during the work. The counter values are retrieved in the runtime before and after several sections in the lifetime of a unit: 
    2.44 +\begin{itemize}
    2.45 +\item Assign: Deciding which unit to execute next
    2.46 +\item Work: Doing the work of the unit
    2.47 +\item Invocation: Switching back from the work to the runtime
    2.48 +\item Request handler: Performing actions requested by the application (e.g. transmitting a message) and then updating constraint status
    2.49 +\end{itemize}
    2.50 +For clarity, all but work are grouped as overhead in the visualization, but they could be displayed separately if needed.
    2.51 +
    2.52 +We also record which core a unit was executed on. All these measurements are directly output into a trace file, which is then evaluated after the run to build the visualizations.
    2.53 +
    2.54 +\subsection{Building the Visualizations}
    2.55 +
    2.56 +Both the UCC and the SCG are represented as directed graphs, with units as nodes.
    2.57 +
    2.58 +\subsubsection{UCC}
    2.59 +For the UCC, units can be either unweighted or weighted, if an estimation of the amount of work is available. Weighted units appear as rectangles with height proportional to the weight, unweighted units are circles. Our implementation can use the number of instructions in the work section from a run to weigh the units. This removes some of the influence of scheduling and data, such as cache misses, but can be insufficient if the application is strongly data- or scheduling-dependent.
    2.60 +
    2.61 +Simple, deterministic dependencies are represented as arcs. Complicated constraints are for now displayed as an additional node bearing information on the constraint with incoming arcs from all units whose execution status affects the constraint and outgoing arcs to the constrained units.
    2.62 +
    2.63 +A critical path algorithm is then used to place the nodes vertically, from top to bottom. For non-deterministic constraints, it is possible to enable or disable their participation in the path. Enabling them will lead to an over-estimation of the critical path, disabling them to an under-estimation. (Better solutions welcome) Horizontally, it is simply ensured that no nodes overlap.
    2.64 +
    2.65 +\subsubsection{SCG}
    2.66 +
    2.67 +For the SCG, all nodes are weighted with the number of cycles spent on the unit in total (work + overhead). Nodes are then further displayed separated into overhead and work. Because it displays a concrete run, the actual choices made for all non-deterministic constraints are available, so all dependencies are deterministic. The same critical path algorithm as for the UCC is used to place nodes vertically, but this time horizontal placement is determined by the core on which the unit was executed  (hardware dependencies ensure no overlap).
    2.68 +
    2.69 +A selection of constraints can then be overlaid, with color codes by type. In SSR we display by default creation, direct and typed message sending (but not the crossing dependency from the receiver to the sender), and singleton outgoing dependencies; but each type can be individually hidden or shown.
    2.70 +
    2.71 +All this information is taken purely from the runtime, leading to a rich, configurable visualization without needing to add anything to the application.
    2.72  
    2.73  \section{Conclusion}
    2.74  \label{conclusion}