Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 49:9a695032203b
Perf Tuning -- Cleaned up walk-through and model description
author: Sean Halle <seanhalle@yahoo.com>
date: Fri, 01 Jun 2012 04:47:58 -0700
parents: f184ed659caa
children: 9f835e24cb7a
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 1 files changed, 65 insertions(+), 40 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 105 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 105
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 105
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed May 30 17:31:19 2012 +0200
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Jun 01 04:47:58 2012 -0700
     1.3 @@ -427,13 +427,23 @@
     1.4  
     1.5  In this section, we show our approach being used during a typical performance tuning session, to see how its features give benefit, and how competing tools' lack of those features makes the work more difficult.
     1.6  
     1.7 -\subsection{Setup}
     1.8 +To prepare, we describe the program and language used, and then the details of a generated visualization. After this we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis for the cause of the loss.  
     1.9  
    1.10 -In our session, we wish to tune a standard program that the reader has likely already experienced attempting to performance tune, and/or knows well. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application. We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores.
    1.11 +\subsection{The Application, and Machine Run On}
    1.12  
    1.13 -The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion, and then waits for completion of the calculation before initiating shutdown. The answers produced by the tasks are sent to a results collector and accumulated into the result matrix. When done, it notifies the setup function of completion.
    1.14 +In our session, we wish to tune a standard program that the reader has likely already experienced attempting to performance tune, and/or knows well. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application. 
    1.15  
    1.16 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying virtutal processors, and three kinds of send-receive paired operations. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the VP dedicated to dividing the work (``creator VP'') that the VP collecting results (``results VP'') has received all sub-results, so it can output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices, to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    1.17 +We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off. 
    1.18 +
    1.19 +The application code uses language features to create a number of virtual processors (VP). The first VP created runs a set of functions that divide the work into a number of pieces and creates a new VP for each piece. 
    1.20 +
    1.21 +How many pieces is set by a tuning parameter in the code, and the number of cores. It uses language features to distribute the VPs  across the cores, initially in a round-robin fashion.
    1.22 +
    1.23 +It then creates   a results VP that receives a partial-result from each piece and accumulates them. The original divider VP  then waits for the results VP to indicate completion, after which the language  shuts down.
    1.24 +
    1.25 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying VPs, and three kinds of send-receive paired operations. 
    1.26 +
    1.27 +The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. It is used by the results VP to tell the divider VP that he work is complete. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This  is used by the VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    1.28  
    1.29   The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to  rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors on different cores, but the copy only needs to be performed once. 
    1.30  
    1.31 @@ -444,18 +454,23 @@
    1.32  
    1.33  
    1.34  \subsection{The Visualizations}
    1.35 - The ffirst is what we refer to as a scheduling consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. 
    1.36 + The first visualization is what we refer to as a scheduling consequence graph, or just consequence graph, CG. It depicts each of the scheduling operations performed by the runtime, and the consequent usage of the cores. 
    1.37  
    1.38 -The second visualization depicts the application-derived constraints on the scheduling decisions. These limit the choices the runtime is allowed to make. We call this the Unit \& Constraint Collection, or UCC. 
    1.39 +The second visualization depicts constraints on those scheduling decisions that come from the application, such as dependencies in the code. These limit the choices the runtime is allowed to make. We call this the Unit \& Constraint Collection, or UCC. 
    1.40  
    1.41  The UCC shows only application-derived information, as opposed to the consequence graph, which combines the \textit{use} of the UCC-depicted constraints with runtime-imposed dependencies and hardware-imposed constraints.  Hence, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
    1.42  
    1.43 -Returning to Fig X, each column is associated with one core. 
    1.44 -A blue-violet vertical block in a column represents the time the core spends doing the actual work of one work-unit, the height being proportional to the number of cycles taken for execution, and the color indicating  intensity of cache misses. Pure red indicates that work-unit had the most cache misses, pure blue indicates the least.
    1.45 +Fig X shows a consequence graph (CG), stylized for purposes of explanation. It is composed of a number of columns, one for each core. A column represents time,  with early at the top, increasing as one goes down, measured in clock cycles. It is broken into blocks, each representing the time accounted to one work-unit. Each block is further divided into regions, each a different color which indicates the kind of activity the core was engaged in during that  region's time-span.  
    1.46 +  
    1.47 +The kinds of activities are defined by the computation model that underlies the visualization. The first is the work of a work-unit, represented by a blue-to-red region. It includes time stalled due to cache misses. The color indicates  intensity of cache misses, with pure red representing at or above the maximum misses per instruction, and pure blue the minimum.  The max and min are set in the tool that generates the visualization.
    1.48  
    1.49 -  Just above each of these, in gray, is the runtime overhead spent on that work-unit, which can be broken into pieces representing acquisition of the lock on the shared semantic state, time spent performing the semantics of the parallelism construct, time spent deciding which ready task to execute next, and time spent switching from virtual processor, to the runtime, and back. 
    1.50 +  The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the viewer, it is further broken into pieces representing activities inside the runtime. These may include acquisition of a lock on shared semantic state, time spent on constraints determining readiness of the work-unit, on deciding which ready one to assign to which hardware, and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead the same way, however in other circumstances a breakdown can be key to understanding interaction between runtime and application. 
    1.51   
    1.52  
    1.53 +The other type of visual feature seen in Fig X is lines. Each represents a construct that influenced scheduling.   The line depicts  two things: a constraint and a decision, both inside the runtime. The constraint was satisfied, which made the   decision possible, choosing which core to do the work on. 
    1.54 +
    1.55 +In general, other kinds of lines may also be drawn, representing other kinds of interactions that affect core usage. For example,  visualization can be turned on for  the internal runtime constraint that only one core at a time may access shared constraint and scheduling state. This appears as additional lines linking the gray runtime regions of blocks. In this paper, visualization is turned off for such minor interactions. 
    1.56 +
    1.57  Successive work-unit (blue-violet) blocks that go in sequence often have a causal dependency between them, due to the semantics of the base language.
    1.58   
    1.59  Many different orderings could also have been validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints. 
    1.60 @@ -466,24 +481,29 @@
    1.61  
    1.62  \subsection{Walk-through}
    1.63  
    1.64 -After functional debugging, the first run produces the visualizations seen in Figures X and X. The first thing to notice, then, is that the first picture is slimmer than expected: of the 40 available cores, only 13 were being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing bracket in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including performance, for several months.
    1.65 +After functional debugging, the first run produces the visualizations seen in Figures X and X. The first thing to notice, is that the first picture is slimmer than expected: of the 40 available cores, only 13 are being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including performance, for several months.
    1.66  
    1.67 -The second run (Fig \ref{story:b}) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. Focusing on core 0, we can see that the task creation VP (short tasks with red edges outgoing) creates work in order of cores, starting with core 0. The core's scheduler operates a simple round-robin between VPs assigned to its core, so the creator VP gets switched out for the newly created work unit quickly. The work tasks take a large amount of time to complete, during which task creation is suspended.
    1.68 +\subsubsection{Second Run}
    1.69 + The second run (Fig \ref{story:b}) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. 
    1.70  
    1.71 -Two solutions came to mind: distribute work to all other cores first so that they would be busy when the creator VP gets interrupted, or dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. This gave us Fig \ref{story:c}. 
    1.72 +Zooming in on those  cores, we see that the task creation VP animates a chain of short tasks (each with a red edge outgoing), and is assigned to core 0. A task is the work in-between scheduling decisions, and creating a task requires switching to the runtime. In order to animate the next work creation task, the creator VP has to be chosen again. However, the creation VP makes the work for all the cores, and starts with itself, core 0, while the runtime animates tasks in the order they become ready. So, after the creator VP makes a work task for itself, that task is ready, and the next chained creation task is put into the queue behind it. That means the work task is chosen next, and the creation task gets left in the queue while work is done, during which task creation is suspended (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tuning).
    1.73  
    1.74 -The section with many cores idling at the beginning has disappeared. A small idle period can still be observed between the first and the second set of work tasks, because the work tasks have roughly the same length (some of them are slightly longer because they perform a copy-transpose singleton, and small variations can be caused by cache misses etc.), so the work on core 0 holding up the creator VP, being last to be distributed, is also last to finish. It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores, so that 16 out of 40 remain idle.
    1.75 +Two solutions come to mind: distribute work to all other cores first so that they would be busy when the creator VP gets interrupted, or dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. 
    1.76  
    1.77 -To try to fill that space, we tried to modify the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help, as almost immediately the time spent creating the increased number of units becomes the bottleneck, and the time lost between sets grows larger than the time lost on the cores not receiving any work.
    1.78 +\subsubsection{Third run} Distributing work to other cores first gives us Fig \ref{story:c}. The section at the top with many cores idling has disappeared. A small idle period can still be observed between the first and the second set of work tasks, because the work tasks have roughly the same length (some of them are slightly longer because they perform a copy-transpose singleton, and small variations can be caused by cache misses etc.), so the work on core 0 holding up the creator VP, being last to be distributed, is also last to finish. It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores, so that 16 out of 40 remain idle.
    1.79  
    1.80 -At this point we wanted to try out if taking the road not chosen, dedicating a core to the create VP, would improve performance more.
    1.81 -Going back to version b of the code and implementing this solution instead lead to fig. \ref{story:f}. The delay between the two sets has disappeared, leading to a 4\% shorter execution time.
    1.82 +\subsubsection{Fourth and fifth runs} To try to fill that empty space, we tried to modify the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help, as  the time spent creating the increased number of units becomes the bottleneck, and the time lost between sets grows larger than the time lost on the cores not receiving any work.
    1.83  
    1.84 -As core 0 is now empty after the creation phase at the beginning, we also moved the receive VP there (fig. \ref{story:g}). This added only a minimal improvement at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting up the work into more pieces, requiring longer collection (fig. \ref{story:h}).
    1.85 +\subsubsection{Sixth run} At this point we wanted to try whether taking the road not chosen, dedicating a core to the create VP, would improve performance more.
    1.86 +Going back to version b of the code and implementing this solution instead leads to fig. \ref{story:f}. The delay between the two sets has disappeared, leading to a 4\% shorter execution time.
    1.87  
    1.88 -Overall it is also noticeable that as work units become smaller, execution becomes more irregular. Variability in task length is likely due to cache misses or page faults, but for verification of this hypothesis more data would need to be collected (for instance, the "cycles stalled due to cache misses" counter available on most modern Intel chips could be tracked and displayed for each unit). 
    1.89 +\subsubsection{Seventh and eighth runs}As core 0 is now empty after the creation phase at the beginning, we also moved the receive VP there (fig. \ref{story:g}). This added only a minimal improvement at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting up the work into more pieces, requiring longer collection (fig. \ref{story:h}).
    1.90  
    1.91 -In some places ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalling tasks are waiting upon the completion of a singleton. However, the operations enclosed in the singleton take only a short time, and start much later than the idle periods. Once again, the simple round-robin scheduler is at fault: When the first VP reaches the singleton code portion, it sends a request to acquire the singleton. This request succeeds, but as sending a request results in suspension of the requesting VP, a new task is scheduled. If this happens to be a long task, the singleton is suspended for a long time, but all other VPs with this singleton have to wait, because it has already been reserved. Because several VPs assigned to the same core share the same matrix pieces so as to increase cache locality, this can result in all VPs on a core being stalled, leading to the observed idle times. This is a property of the language runtime, so the application programmer cannot change this, but making the work units smaller helps minimize these effects. 
    1.92 +Overall it is also noticeable that as work units become smaller, execution becomes more irregular. Variability in task length is likely due to cache misses or page faults, but for verification of this hypothesis more data would need to be collected (for instance, the ``cycles stalled due to cache misses" counter available on most modern Intel chips could be tracked and displayed for each unit). 
    1.93 +
    1.94 +\subsubsection{holes in the core usage}
    1.95 +
    1.96 +In Fig X, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalling tasks are waiting upon the completion of a singleton. However, the operations enclosed in the singleton take only a short time, and start much later than the idle periods. Once again, the in-order animation of queued of tasks is at fault: When the first VP reaches the singleton code portion, it sends a request to acquire the singleton. This request succeeds, but as sending a request results in suspension of the requesting VP, a new task is scheduled first. If this happens to be a long task, the singleton is suspended for a long time, but all other VPs with this singleton have to wait, because it has already been reserved. Because several VPs assigned to the same core share the same matrix pieces so as to increase cache locality, this can result in all VPs on a core being stalled, leading to the observed idle times. This is a property of the language runtime, so the application programmer cannot change this, but making the work units smaller helps minimize these effects. 
    1.97  
    1.98  %\begin{figure}[ht]
    1.99  % \begin{minipage}[b]{0.5\linewidth}
   1.100 @@ -546,18 +566,22 @@
   1.101  \section{The Model Behind the Visualization}
   1.102  \label{theory}
   1.103  
   1.104 -Now we expand on the model behind the visualizations, which is what ties the information together. It has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (just consequence graph or CG).  The UCC indicates the freedom of choice the application allows, encoding what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices. We give a precise definition of UCC then consequence graph, in turn.
   1.105 +Now that the usage has been seen, we expand on the model behind the visualizations, which is what ties the information together. Understanding of the model leads to quickly seeing the reason for performance-related patterns in the visualizations. Such understanding generates the hypotheses of the source of performance loss. 
   1.106  
   1.107 +The model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (just consequence graph or CG).  The UCC indicates the freedom of choice the application allows, encoding what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices. 
   1.108 +
   1.109 +We give a more precise description of UCC then consequence graph, in turn.
   1.110 +However, space is too limited for a complete definition, which is given in a companion paper submitted to a longer format venue.
   1.111  \subsection{Unit \& Constraint Collection}
   1.112  The UCC contains all units of work that get scheduled and all application-related constraints on scheduling them. That's a nice solid definition but things aren't quite that simple. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even the scheduler.
   1.113  
   1.114  Some applications have everything determined in the code, with all units fixed, and all constraints fixed. An example is matrix multiply with fixed size matrices.  But for others, the shape of the UCC is only partially defined by the application code.  Take matrix multiply when the size is an input parameter, and the toolchain divides the work into a fixed number of pieces. Then there are a fixed number of units, but their size depends on the input parameter. So, the UCC is different for each parameter value. An extreme example is an NP complete problem, for which the units are a function of both the particular input data \emph{and} decisions made by the scheduler!
   1.115  
   1.116 -    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC. But the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
   1.117 +    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, as seen in this paper. But the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
   1.118  
   1.119  Figure X shows the two axes and the four factors on each that indicate how much of the UCC the application code determines. The horizontal axis is for  units, vertical for constraints. The values on an axis start at 0 with units/constraints fully fixed by the application code alone, 1 means they are fixed by parameters plus code, 2 means the values of input data also play a role, and 3 means they are determined by scheduling decisions as well as (optionally) any of the rest. 
   1.120  
   1.121 -The closer the UCC is to the origin, the more concrete it is. But even the least concrete UCC, out at the end of the diagonal, becomes concrete during a run of the application. As the computation unfolds, data values interact with scheduling decisions, to fix the units and constraints among them, until all are concrete at the end of the run.
   1.122 +The closer the UCC is to the origin, the more concrete it is. But even the least concrete UCC, out at the end of the diagonal, becomes concrete during a run of the application. As the computation unfolds, data values interact with scheduling decisions, to fix the units and constraints among them, until all have been made concrete at the end of the run.
   1.123  
   1.124  Notice, though, that such a concrete UCC still has degrees of freedom. The same units could have been chosen, with the same constraints on them, but different decisions made about which hardware the units were assigned to, and which order the assignment was made in. These decisions interact with the hardware, to yield the communication patterns and consequent performance seen during the run. 
   1.125  
   1.126 @@ -567,7 +591,7 @@
   1.127  
   1.128  Applications that are far out on the diagonal only specify a portion of the structure of the UCC, but the UCC still tells what is inside the application's control vs under the scheduler's control. It thus indicates what can be done statically. The further out on the diagonal a UCC is, the less scheduling can be done statically in the toolchain.
   1.129  
   1.130 -In this paper, we do not suggest how to represent UCCs far out on the diagonal. Those actually indicate a multi-verse of UCCs. Which of them materializes  depends on the data that shows up and what the scheduler does. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
   1.131 +In this paper, we do not suggest how to represent UCCs far out on the diagonal. Those actually indicate a multi-verse of concrete-UCCs. Which of them materializes  depends on the data that shows up and what the scheduler does. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
   1.132    
   1.133  
   1.134  
   1.135 @@ -576,13 +600,23 @@
   1.136  
   1.137  The consequence graph integrates all three aspects of parallel performance: application, runtime, and hardware. It identifies all instances of lost performance, and links them to the cause of the loss, as was seen during the tuning story in Section X.
   1.138  
   1.139 -This graph captures the three kinds of causal relationships that affect core usage: scheduling decisions invoked by the runtime, dependencies inside the runtime itself, and hardware constraints. Each of these causalities causes some portion of the core usage that took place, and shows up as a feature on the consequence graph. 
   1.140 +This graph captures the three kinds of causal relationships that affect core usage: scheduling decisions invoked by the runtime, dependencies inside the runtime itself, and hardware constraints. Each of these  causes some portion of the core usage that took place, and shows up as a feature on the consequence graph. 
   1.141  
   1.142 -To distinguish from the UCC, the consequence graph shows the behavior resulting from scheduling decisions actually \emph{made},   not which were \emph{possible.} The UCC shows the possibilities. Hence, a consequence graph shows one of the possible sets of choices allowed by the UCC.
   1.143 +To distinguish from the UCC, the consequence graph shows the behavior resulting from scheduling decisions actually \emph{made},   from among those \emph{possible.} The UCC shows just the possibilities. Hence, a consequence graph shows \emph{one} of the possible choice-sets allowed by the UCC.
   1.144 +
   1.145 +A consequence graph accounts for each bit of core time. It has boxes that each represent one segment of core time, and arcs between the boxes that each represent a causal dependency of some kind. Each box is counted towards one work-unit, and  all boxes counted to the same unit make a node.
   1.146 +
   1.147 +There are several kinds of boxes, one for each reason that the core is being used (or being forced idle), and several kinds of arcs, one for each type of causality between boxes.
   1.148 +
   1.149 +The box types are arranged by cause of core usage: application work, waiting for communication of work data, runtime internals, managing constraints, and choosing assignment of work onto cores. The runtime internals have sub-categories but space is limited so we skip those here.
   1.150 +
   1.151 +The arc types are arranged by source of the causal relationship: base-language control dependencies, parallel semantic constraint which fed into a particular choice in the runtime (IE, the choice ties together two specific work-units so the one completing causes other to start), runtime internal causality such as a global lock or a distributed quorum algorithm whose action creates causal dependencies between boxes that represent execution of internal runtime code, and arcs that represent hardware causal relationships, such as one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime. The finer details are beyond the scope of this paper.
   1.152 +
   1.153 +
   1.154  
   1.155  We will now look at each source of causal relationship.
   1.156  
   1.157 -\paragraph{Scheduling decision causality} Notice that the performance varies between choice-sets. The main reason for the variation is the communication patterns resulting from the particular choices. For a fixed concrete UCC, each set of scheduling choices it allows has a consequent pattern of core usages, which \emph{is} the performance realized. 
   1.158 +\paragraph{Scheduling decision causality} Notice that the performance varies between choice-sets. The main reason for the variation is the communication patterns resulting from the particular choices. For a fixed concrete UCC, each set of scheduling choices it allows has a consequent pattern of core usages. That pattern \emph{is} the performance realized. 
   1.159  
   1.160  The consequence graph also shows control dependencies from the base language, which may add superfluous constraints that further eliminate some otherwise allowed choices in the UCC. An example would be a \texttt{for} loop that creates work-units -- no parallelism constructs cause a sequentialization of the creations, but the base C language sequentializes it nonetheless. 
   1.161  
   1.162 @@ -594,18 +628,9 @@
   1.163  
   1.164  These are also not normally displayed, due to clutter, and not all hardware dependencies are directly measured. Future work will focus on using the performance counters and other instrumentation to add more information about communication paths taken as a consequence of the scheduling decisions made. This takes the current linkage of application-code to runtime decisions, and adds consequent communication patterns, which are the primary free variable in resulting performance. This gives an end-to-end linkage between code choices and caused behavior on the hardware for performance. 
   1.165  
   1.166 -Three sources of causality: application-derived, runtime, and hardware. 
   1.167 +Consequence graph features each tie back to features in the UCC and thence to specific segments of code or constructs.
   1.168  
   1.169 -How consequence graph features each tie back to features in UCC and to specific segments of code or constructs.
   1.170  
   1.171 -In detail, the consequence graph
   1.172 -\subsubsection{A precise definition of Consequence Graph} A consequence graph accounts for each bit of core time. It has boxes that each represent one segment of core time, and arcs between the boxes that each represent a causal dependency between boxes. Each box is counted as associated to one work-unit, and then all boxes counted to the same unit are collected into a node.
   1.173 -
   1.174 -There are several kinds of boxes, one for each reason that the core is being used (or being forced idle), and several kinds of arcs, one for each type of causal dependency.
   1.175 -
   1.176 -The box types are arranged by cause of core usage: application work, waiting for communication of work data, runtime internals, managing constraints, and choosing assignment of work onto cores. The runtime internals have sub-categories but space is limited so we skip those here.
   1.177 -
   1.178 -The arc types are arranged by source of the causal relationship: base-language control dependencies, parallel semantic constraint which fed into a particular choice in the runtime (IE, the choice ties together two specific work-units so the one completing causes other to start), runtime internal causality such as a global lock or a distributed quorum algorithm which creates causal dependencies between boxes that represent execution of internal runtime code, and arcs that represent hardware causal relationships, such as one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime. This should give a reasonably precise description of the arcs, while the finer details are beyond the scope of this paper.
   1.179  
   1.180  \subsection{Levels of UCC and Consequence Graph}
   1.181  There is one last twist to the story of UCCs and consequence graphs, which is that there are levels of them that correspond to the levels of scheduler in a hierarchical machine. We use an example involving a server machine with a hierarchy of runtimes to illustrate both, concentrating first on just the UCCs, then adding the consequence graph. 
   1.182 @@ -630,15 +655,15 @@
   1.183  
   1.184  A consequence graph ties together scheduling decisions made on units with the consequences in the hardware of those decisions. The goal is to charge each segment of time on a physical core to exactly one box in a consequence graph.
   1.185  
   1.186 -In the UCCs, for a higher up runtime, each lower runtime it schedules onto is treated as a machine. We saw in Fig X that a unit has and entire  UCC in the level below, so there is a corresponding consequence graph. The UCC states the degrees of scheduling freedom, while the consequence graph shows the hardware consequences resulting from the particular scheduling choices made in the runtime.
   1.187 +In the UCCs, for a higher up runtime, each lower runtime it schedules onto is treated as a machine. We saw in Fig X that a unit has an entire  UCC in the level below, so there is a corresponding consequence graph. The UCC states the degrees of scheduling freedom, while the consequence graph shows the hardware consequences resulting from the particular scheduling choices made in the runtime.
   1.188  
   1.189   Now, the question is, what hardware usage should get counted towards one of the units? The answer is: only the time spent on the cores that is used to schedule, do work, and wait for non-overlapped communication of that unit.
   1.190  
   1.191  The time spent on scheduling one of the units is straight-forward, it's the normal runtime overhead of receiving a unit, managing the constraints on it, and choosing the best location and time to execute it.  The only variation is that the location chosen is a lower-level runtime rather than a physical core.
   1.192  
   1.193 -But what core time should be charged as the work of that unit? The answer is: the core time not accounted for in the descendent consequence graphs. Each segment of physical core time can only be charged to one box in one consequence graph, so only the leaf graphs count the actual work. Further, the time spent in the lower runtime spent receiving, handling constraints, and choosing when and where to schedule the sub-units is charged to boxes in the lower-level consequence graph.  By the process of elimination, the only time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the work-time box for a unit.
   1.194 +But what core time should be charged as the work of that unit? The answer is: the core time not accounted for in the descendent consequence graphs. Each segment of physical core time can only be charged to one box in one consequence graph, so only the leaf graphs count the actual work. Further, the time spent in the lower runtime spent receiving, handling constraints, and choosing when and where to schedule the sub-units is charged to boxes in the lower-level consequence graph.  By the process of elimination, the only time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the work-time box for a higher-level unit.
   1.195  
   1.196 -The last question is how to handle communication consequences. This is tricky because decisions in higher-level runtimes set the context for decisions in lower-level ones. This means a higher-level choice is linked to the consequences from lower-level choices. The value of a consequence graph is linking the size of boxes in it to the decisions made by the scheduler, as represented by the shape. It's not clear how to divide, among the levels, the time that cores spend waiting for non-overlapped communication. We have no good answer at the moment and leave it for future work.
   1.197 +The last question is how to handle communication consequences. This is tricky because decisions in higher-level runtimes set the context for decisions in lower-level ones. This means a higher-level choice is linked to the consequences from lower-level choices. The value of a consequence graph is due to linking the size of boxes in it to the decisions made by the scheduler, as represented by the shape. It's not clear how to divide, among the levels, the time that cores spend waiting for non-overlapped communication. We have no good answer at the moment and leave it for future work.
   1.198  
   1.199  \section{Implementation}
   1.200  %%how are graphs generated, what is needed
   1.201 @@ -662,7 +687,7 @@
   1.202  
   1.203  \section{Conclusion}
   1.204  \label{conclusion}
   1.205 -We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: to each other, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others. 
   1.206 +We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: each to others, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others. 
   1.207   
   1.208  
   1.209  \bibliography{BibForPapers}