Category Archives: Appendix- Program Optimization

Code Generation With Unfixed Variable Locations

Sorry for the long break.  I’m hoping to use this summer to get most of the rest of the way through this book.

The Problem: Code Generation With Unfixed Variable Locations.  This is problem PO8 in the appendix.

The Description: Given a program (a sequence of instructions I1..In), a set V of variables, a “short address span” B, and a desired program size M.  We want to assign addresses to both the instructions of the program and the variables.  If an instruction accesses a variable that is within B instructions of its use, we can represent that instruction in one byte (using the “short address”).  Otherwise, the instruction is represented in two bytes (using the “long address”).  Can we place all instructions and variables in M addresses or less?

Example: Suppose I have the following instructions (A, B, and C are variables, “ref” just means the instruction refers to that variable)

ref A
ref A
ref B
ref C

If B=1 (so all short references have to be within one instruction of the variable address), we can place all instructions and variables in 7 addresses, for example:

ref A
A
ref A
ref B
B
C
ref C

If instead that second A reference was at the end of the sequence of instructions:

ref A
ref B
ref C
ref A

..then (because we can’t change the order of the instructions in the program), we would have to make one of the references to A long, for example:

ref A
A
ref B 
B
C
ref C
ref A (long)

That second instruction that references A is a long reference, so this program needs 8 bytes of address space.

Reduction: The paper by Robertson uses 3SAT, So let’s start with a 3SAT instance with n variables and m clauses, each clause Cj contains literals Aj1, Aj2, and Aj3.Having just 1 or 2 literals in a clause is ok too.

Robertson notes that for any two literals V and W, if you have V and ~W appear as one (complete, two-element) clause, and also have ~V and W appear as a different clause, then V and W must both have the same truth value in order to satisfy both clauses.  This gets us a way to add more variables to the formula that have truth values the same as some other variable.  Specifically, if some variable V shows up in our formula more than 5 times, we can create a new variable W to replace three of those occurrences of V, and add two clauses to the end (the clauses (V ∨ ~W) and (~V ∨ W)).  This forces our new W to have the same truth value V will have if the formula is satisfied, and reduces the number of occurrences of V.  We repeat this process until no variable appears more than 5 times in the formula.  So, now we can say that for some variable Vi, it appears in clauses Ci1 through Ci5 (or maybe less than 5 if the variable occurs less than 5 times).

The program we will build will consist of “modules” that can be moved independently.  Each variable will have two modules (“t” and “f” modules), and each clause will have one.  Here is a picture of a variable “t” module:

“l” is the paper’s version of B- the maximum distance we can put a short reference to a variable.

What we’re seeing here is:

  • A “slot” we can put a variable (or more than one) into
  • 16 instructions that refer to a new variable ui
  • 3 instructions that reference each of the variables corresponding to the (at most 5) literals and clauses the variable can appear in.

The “f” mode is similar except that the definition of τr is negated (it’s the positive variable if ~Vi occurs as a literal).  The new variable ui appears in the t and f modules for variable i and noplace else, so the only way references to it can be short is if it is placed into the “slot” in either the “t” module (meaning variable i is set to true) or the “f” module (meaning variable i is set to false).  If we put ui in the slot of (say) the “f” module, then we have enough room to also place the variables corresponding to the literals in the same block making everything short.  But by putting the ui variable in the “f” module, all of the references to it from the “t” module will be long (adding 16 extra bytes of address space), making all of the literals too far away from the slot for the slot to be useful in that module.

We also have a module for the clause (the “c”-module):

Notice that the versions of the literals that actually appear in the clause are to the right of the slot.  These α variables are the same ones we had in the variable modules (so each α variable shows up once in a “c” module, but 3 times each in a “t” or “f” module).  Since there are more references to the α variables in our variable modules, the optimal placement should be to place those variables in the slot of the “t” or “f” modules (not in the “c” module).  The “remaining” α variables (the negations of the ones we put into the slot in either the “t” or “f” modules and which it didn’t help us to put in the slot of the other module) can go into the slot here and make one of the corresponding β references short.

We do have that extra γj cell to play around with, and here is where either the paper has a mistake, or I lose the ability to follow.  What is supposed to happen is that if we fill the slot with a variable that is referred to by one of the β variables then that reference will be short, and the reference from γ to δ will be within the range of a short reference.  That’s fine, but the paper wants this to be the α variable that we did not place in the “t” or “f” module.  But if placing the α in the “t” module meant setting the variable to true, then what is not in the module is the negation of that module, which feels like we are doing things backward- like we want to set the variable to true, we should have put ui in the “f” module so that the literal that can be placed in our clause module corresponds to the true version of the literal.

I’m not sure it matters much to the reduction.  If I’m right, it just means we should have labeled the α variables the negations of what they really are.  But it does worry me.

The proof finishes with deciding values of B and M to make it all work.  B ends up being 31 (it comes from the distance we need to separate the 16 ui instructions from the τ instructions), and M ends up being 16n+4* the total number of literals in the formula.

Difficulty: 7.  This has a lot of pieces, but (except for that end part) I think it’s pretty feasible to see how all of the pieces fit together.

 

 

Code Generation With Address Expressions

For this problem, I think it is easier to see what is happening if we use a problem definition that uses actual instructions (instead of the more abstract instructions G&J use in their definition).  So I am taking this problem description from the paper by Szymanski that has the reduction.

The problem: Code Generation With Address Expressions.  This is problem PO7 in the appendix.

The description: The PDP-11 has the following instructions:

  • Unconditional Branch (“br”), is two bytes long, and can only be used if the branch target is within 256 bytes or so of the branch instruction (technically 256 larger or 254 smaller).
  • Unconditional Jump (“jmp”), which works similarly to “br”, but is 4 bytes long and has no limit to the distance to the target.
  • Conditional branches (“beq” and “bne”), which like “br” are two bytes long and require the target to be within a fixed distance of the instruction
  • Extended conditional branches (“jbr”, “jeq”, and “jne”).  These instructions are either translated into “br”, “beq”, and “bne” if the targets are close enough, or they are treated as jmp instructions or conditional branches to a jmp instruction if the destination is too far.

Given a program with a series of these extended conditional branch instructions, and a goal size s, can we implement the program (by replacing the extended conditional statements with branches and jumps as appropriate) in s bytes or less?

Example: First, let’s look at how these jumps can happen.

Suppose I have:

    jbr  L1
    (some number of bytes of instructions without
     branches or jumps)
L1:

.. if the number of bytes in between is small, I can replace that with:

    br L1
    (the instructions just like before)
L1:

But if the size of the intermediate instructions is too large to allow a branch, we need to do:

     jmp L1
     (the instructions just like before)
L1:

Since a jmp is 4 bytes long and a br is 2 bytes long, the second option leads to a larger number of bytes at the end.

Here’s another example from the paper:

X: jne Y
   252 bytes of instructions
   jbr X
Y:

If we make the jbr instruction into a jump, then there is 256 bytes (+ the length of the jne instruction)  between the jne instruction and X, so the jne will also have to be a jump.  But if we make the jbr instruction into a br, then there are 254 bytes of instructions in between X and Y, so the jne can become a branch as well (so the distance between X and Y is 256 bytes), meaning that the jbr correctly was used as a branch (if the jne was a jump, the jump back to X would have covered too much space).

Reduction: Szymanski uses 3SAT.  So we start with a 3SAT instance with n variables and m clauses.  For each variable xi, we add the following lines to our program:

Yi:  jbr
     254 bytes of non-branching or jumping code
Zi:

(I’m a little confused by this because he doesn’t list a destination for the jbr here.  I think we’re supposed to go to Zi, so that we can potentially replace the jbr with a br)

For every clause Cj = {lj,1, lj,2, lj,3} we add the following lines to our program:

Ai:  
    246 lines of non-branching or jumping code
    jbr Tj1
    jbr Tj2
    jbr Tj3
Bj:

..where if lj,k is an un-negated version of variable xi, Tjk is an address calculated as the current address +Zi-Yi. If lj,k is a negated version of variable xi, Tjk is calculated as the current address +Zi-Yi-512.

We also add to the end of every clause n+3m+1 copies of a jbr statement to a line Bj-Aj bytes away from the current line.

The total length of this program is: 256n + 254m + 2nm + 6m2 + 2q, where q is the number of times we need to use a long jump instruction.  We have n+3m+m(n+3m+1) jbr instructions to deal with.  Our goal size s is 258n+280m+2nm+6m2, which we can only reach if we have at most n+3m jmp instructions (the rest being replaced by br’s)

Each of the Yi jumps can either be replaced by “br” instructions (which we will interpret as setting variable i to true) making the distance between Zi and Yi 256 bytes), or by jmp instructions (which we will interpret as setting variable i to false), making the distance between Zi and Yi 258 bytes.

The distance between each of the Ai and Bi in the clause pieces is 246 bytes + 2 bytes for each br we use or 4 bytes for each jmp we use.  So the total distance is 252, 254, 256, or 258 bytes,  Notice that we can only do a short translation if the corresponding literal is true, meaning that there are 256 or fewer bytes in the clause fragment if and only if the clause can be satisfied.

So, if the original formula is satisfiable, we will be able to find a way to satisfy each clause j to make the distance between the Aj and Bj labels 256 bytes or less.  So all of the “extra” jbr instructions at the end of each clause can be given short translations, so we will have our total program be of size s or less.

If the original formula is unsatisfiable, then no matter what we decide to do with our variables, there will be at least one clause where we have to give the long (258 byte) distance between Aj and Bj (this is the clause that is unsatisfied by whatever variable assignment we choose).  This means that the extra jbr Bj-Aj instructions at the end of that clause must turn into jmp instructions that are 4 bytes long each, making the overall size of the program too long.

Difficulty: 7.  It takes a while to understand the distances, and one of the consequences of using the real PDP-11 machine is the difference in distance between forward jumps that can be turned into branches (256 bytes) and backward ones (252 bytes), which makes the arithmetic and algebra a bit harder than I’d like.  But I think this is a neat reduction.

Code Generation for Parallel Assignments

The problem: Code Generation for Parallel Assignments.  This is problem PO6 in the appendix.

The description: Given a set of variables V= {v1 ..vn}, a set of assignments A= {A1..An}, where each assignment sets some variable vi to the result of some operation on a subset of parameters of V, and an integer K.

Can we order the elements of V in such a way that an assignment references a variable that comes later in the ordering at most K times?

Example: Sethi’s paper, which describes this problem, defines a parallel assignment as a statement like this:

i,x <- i+1, x+i

The idea is that both i and x get set to their values in the same assignment.

Sethi also defines a realization of an instruction, which is the ordering in which we do the assignments.  One realization of the above instruction is to compute i before x, and the other is to do x before i.

The cost of that realization is the number of variables that cause references to future instructions in the realization.  So, if we compute i before x, that realization is cost 1, because the x+i instruction has to wait for the change to i to happen  (and we have to keep the “new” value of i around someplace to use in the x instruction).

Computing x first and i second has a cost of 0, because we don’t need to keep the value of x around for the second instruction.  Our job is to minimize this cost (or, technically, to see if the cost is <= K).

What’s interesting to me, but not really discussed in the paper, is the fact that different realizations have different semantic meanings.  If we compute the i+1 first, x gets the updated value of i, but if we compute it second, then x gets the old value of i.  Maybe this kind of potential confusion is why we don’t typically see this kind of assignment operation in the present day.

Reduction: Sethi uses Feedback Node Set. So we start with a directed graph G.  We will map all of the vertices in the graph to variables.  For each vertex vi in the graph, if vi has edges connecting to w1..wk, we say that vi is computed from those vertices (the w1..wk are the parameters to create vi).  We will use the same K from the Feedback Node Set problem.

Suppose our graph has a set U of feedback vertices.  Sethi provides a separate proof about feedback vertex sets: A graph G has a feedback set U if and only if we can order the vertices in the graph y1..yn such that if we have a “backward” edge from yj to yi, if j > i, then yj is in U.

So, we use that ordering as our realization for our assignments. These “backward” edges correspond to “backward” references in our realization (which contribute to the cost).  Since each of these back edges must start with a vertex in U, and each one of these vertices costs 1 to our realization cost, this means that our realization cost is at most the size of U, which we know is K.

Difficulty: 6.  The sequence proof is a bit challenging to follow but other than that the hard part of this is understanding the problem.  One thing that tripped me up was that the cost of the realization was the number of variables that had backward references, not the number of references themselves.  (So, a variable that has 2 references to 2 different future instructions counts just once)

Code Generation With Unlimited Registers

The reason I found the easier reduction for the previous problem is because it was in the paper that discussed this similar problem.

The problem: Code Generation With Unlimited Registers.  This is problem PO5 in the appendix

The description: Given a Directed, Acyclic graph G=(V, A), where no vertex has an out-degree larger than 2.    Just like the previous problem, the leaves of this graph (with in-degree 0) are our starting values, and the roots of the graph (with out-degree 0) are the final values we need to compute.

The operations that are allowed are:

  • ri <- rj (copy a value between registers)
  • ri <- ri op rj  (replace a value in ri with the combination of ri and rj.  This is how we “compute” nodes in the graph with 2 children)

(The problem definition in G&J makes a big deal about labeling the (at most) 2 edges leaving a vertex v as “Left” and “Right”, and so the ri comes from the “Left” side, and the rj comes from the “Right” side.  But G&J also say (as a “private communication”) that this problem remains NP-Complete if we allow commutative instructions- in other words, ri <- rj op ri is also allowed.  So I don’t think it matters.)

Can we compute the root vertices of G in K or fewer instructions, if we allow any number of values to be stored in registers?

Example: Here’s the complicated graph from last time:

We can start by loading the 1,2,3, and 4 into registers 1-4.  Then we can do:

  • R1 <- R1 op R2 (compute a)
  • R2 <- R2 op R4 (compute d)
  • R3 <- R3 op R4 (compute c)
  • R2 <- R2 op R3 (compute f)
  • R1 <- R1 op R2 (compute g)
  • R5 <- 2 (another copy)
  • R6 <- 3 (another copy)
  • R5 <- R5 op R6 (computing b)
  • R3 <- R3 op  R5 (computing e)
  • R1 <- R1 op R3 (computing h)

The reason why this is hard, even with as many registers as we want, is because the operations are “accumulator” operations- they always put the result of the operation back into one of the operand registers, killing the old value.  So if we want to use that value more than once, we either need to reload it or be very careful about the order in which we compute vertices.

If we allowed 3-register operations like rk = ri op rj, then this is easily solvable by computing values “up the graph” in basically any bottom-up order you want, putting each vertex’s value into a new register.

Reduction: This is in the paper by Aho, Johnson, and Ullman that we used last time also talks about this variant of the problem.  They say that it’s basically the same situation, you just need to load the lead values into registers right away, and you get a situation that is “similar to” the one-register machine.  I think that the basic structure of the chains we are building still holds in this new situation.

Difficulty The one-register problem is a 6 or a 7, this is probably a 5 or 6 once you have that.  If you are going to make the students walk through the whole reduction step by step in this new situation, it might be a bit harder.

Code Generation on a One-Register Machine

I’ve done a lot of traveling over the past month, so I’m sorry about missing posts.  Things should be back to normal now.

The problem: Code Generation on a One-Register Machine.  This is problem PO4 in the appendix.

The description: Given a directed acyclic graph G = (V,A), in which no vertex has an out-degree larger than 2, and a positive integer K.  The leaves of this graph (the vertices with out-degree 0) are our starting values, sitting in memory.  Can we compute the values in all root vertices  (with in-degree 0) in K instructions or less, if our only instructions are:

  • Load a value from memory into our only register.
  • Store a value from the register into memory
  • Doing an operation combining a value in a register and a value in memory.  The operation must connect two children of a vertex in a graph together, and the “result” is the parent vertex.  The result value replaces the original value in the register.

Example:  Here’s a simple graph:

Here, we can compute the “+” node by:

  • Loading 1 into the register.
  • Doing the + operation between the register and 2
  • Storing the + operation to memory (G&J’s definition of the problem says that the node is not computed until the value is stored)

We can compute the “-” node in another 3 instructions, and since the value of the “-‘ node is still in the register, compute the “*” node in 1 more instruction, and store it with out last instruction.

Here’s a more complicated graph:

To do this one, we will have to load the value in node 2 lots of times.  For example, here is the set of instructions I came up with to compute h:

  • load 1
  • op to create a (only 1 operand)
  • store a
  • load 4
  • op to create c (3 is in memory)
  • store c
  • load 2
  • op to create d (4 is in memory)
  • op to create f (c is in memory)
  • op to create g (a is in memory)
  • store g
  • load 2
  • op to create b (3 is in memory)
  •  op to create e (c is in memory)
  • op to create h (g is in memory)
  • store h

It’s possible that we can do this in fewer instructions, but hopefully, you can see why this problem is hard- knowing what value to keep in the register is tricky.

Reduction: G&J point out that the reduction is from a paper by Bruno and Sethi, which uses 3SAT to do the reduction.  The instance they build is pretty complicated.  I also came across a paper by Aho, Johnson, and Ullman, who extend the result of the Bruno and Sethi paper with a nice reduction from Feedback Vertex Set.  I think this reduction is easier to follow, so we’ll go with that.

So, we are given an instance of FVS- a directed acyclic graph G and an integer K.  We are looking for a set F of K vertices such that every cycle goes in G goes through some element of F.

We are going to build our Code Generation graph D as follows:

  • For every vertex in G with outdegree d, build a “left chain” of d+1 vertices.  So if vertex a had 2 vertices leaving it, we will create 3 vertices b0, b1, and b2.  b2 will connect to b1, and b1 will connect to b0.
  • Each of the “0” vertices at the bottom of these chains connects to 2 distinct memory values (they will be the leaves of the code graph)
  • If vertex v has outdegree d, each vertex in a’s chain will connect to the different “0” vertex of the different neighbors of v in G.

Here is an example from the paper:

Notice that if we don’t have the edges between the chains, we can compute the entire chain with just 2 loads (of the leaves that start in memory).  So, the only loads needed to compute all of D happen in the leaves, or in some of the “level 1” vertices that are parents of the leaves.   If we have to re-load one of those vertices, it is because there is no optimal strategy to avoid loading it, which means it’s part of a cycle.

For example, look at the a and b chains in the picture above.  If we didn’t have any of the c or d vertices or edges in our graph, we could compute a1 and b1 without loading any vertex that is not a leaf: compute b0, b1, b2, then a0, then a1 (which uses a0 from the register and b0 from memory).  The reason we can do this is that while a1 depends on b0, none of the b vertices depend on anything in a, which gives us a chain to do first.  We need to reload a value when we have a circular dependency between chains (so there is no correct chain to do first).  That’s the relationship between the chains and the feedback vertex set.

This works in the other direction as well- if we are given the feedback vertex set in G, we can compute those vertices first in D, and then load them all once as needed to compute D.

The paper says that in the example graph, the node {d} by itself is a Feedback Vertex Set, and the optimal computation ordering is: d0,c0, c1, b0,b1, b2, a0,a1, d1.  That final d1 needs a re-load of d0.  The 1 extra load corresponds to the 1 vertex in our Feedback Set.

Difficulty: 6.  Maybe 7.  I think this is right at the limit of what a student can figure out, but I would also want a more rigorous proof about the connection between the extra loads and the feedback set, which is probably tricky to come up with.

Register Sufficiency For Loops

We finally get to the book problem that was the basis for our diversion.

The problem: Register Sufficiency For Loops.   This is problem PO3 in the appendix.

The description: Given a set of variables to be used within a loop of N instructions, and a set of K registers.  Each variable has a “lifetime” defined by the instruction in the loop we start needing its value, and the number of instructions we will that need that value for). We can assign two variables to the same register if they do not overlap in their lifetimes (keeping in mind that “overlap” includes wrapping around to the start of the next loop iteration).  Can we assign variables to registers such that there is no overlap?

Example: I think it’s easier to understand the concept of a “lifetime” of a variable by showing an example of a loop:

B
C
B
C
A
C
A
C
A
C
B
C
B

In this loop, we will need one register to hold C (because we’re basically using it the whole time), but the other register can alternate between holding A and B.  It holds A in the middle of the loop, but it can hold B from the end of one iteration through the start of the next.

If I added an extra B in the middle of the A instructions, I would need 3 registers (because we can’t have 1 register holding A and B because the “middle” use of B would conflict with the need to hold A).

Reduction: While this is the main result for us, this problem is mentioned as an aside in the Garey, Johnson, Miller, and Papadimitriou paper we have been working with for the past few weeks.

The idea is that this problem is very similar to Arc Coloring.  They show the problems are similar, which is sort of like doing the reduction backwards (“Look, any loop can be represented as a set of arcs!”).  But if we start from an Arc Coloring instance, we can create a loop with one statement for each point in the circle, and each arc connects the first and last uses of a variable (so the arc spans the lifetime of the variable).

The idea now is that a coloring of the graph is a way to color the arcs of the graph so that no two arcs overlap at a point.  This is exactly the same as saying we can assign a register to a set of instructions (matching the 2 uses of a variable) so that the same register is not in use during the lifetimes of two variables assigned to that register.

Difficulty: The actual reduction here is a 4.  (I think it’s a little tricky to understand the idea of a lifetime of a variable, and how it can be represented as just 2 points in the circle)  But it may be more complicated than that to explain what an arc graph is, and does.

Word Problem For Products of Symmetric Groups

The next problem in the Appendix is “Register Sufficiency for Loops”, which basically asks “Given a set of loop variables and K registers, can we assign variables to registers so that there is no conflict between loop iterations?”.  The reduction is better explained in two parts.  This is the first part.

The problem: The “Word Problem for Products of Symmetric Groups” (WPPSG).  This is not in the appendix.  It is defined in the paper by Garey, Johnson, Miller, and Papadimitriou that has the reduction we will want for the Register Sufficiency for Loops problem.

The description: Given an integer K, and m sets X1 through Xm, each containing a subset of {1..K}, and a permutation π of the numbers from 1-K. Can we get π by a series of permutations defined by permuting with X1, then X2, and so on through Xk?

Example:

To “permute with” an Xi set,  I mean we can rearrange the elements corresponding to the positions in Xi however we want, but elements that are not listed in Xi do not move.

So, suppose we had K=5 and 3 sets:

X1 = {1,3,5}

X2 = {2,3}

X3 = {4,5}

If π = {3,1,2,5,4}, we can do this by first permuting with Xusing {3.1,5} turning {1,2,3,4,5} into {3,2,1,4,5}.  Then we permute X2 using {3,2},turning {3,2,1,4,5} into {3,1,2,4,5}.  Finally, we permute X3, using {5,4} turning {3,2,1,4,5} into {3,2,1,5,4}.

If π = {2,1,5,3,4} we can’t do it.  There is no way to get a 2 into the first location, since the only permutation that can possibly place something in the first location is X1, and we have to do the X permutations in order.

Reduction:

If you’re following along, this is Theorem 2 of the paper.  They go from Disjoint Connecting Paths, but add the restriction that the graphs are directed acyclic graphs.  (They say it’s a “trivial modification” to a standard reduction.  We did that reduction slightly differently, but since we went from a flow problem, I think it is fine to convert that to do a DAG version of the problem reduction from a directed version of the flow problem).

They also modify the problem so that the start vertices all have indegree 0 (and all vertices with indegree 0 are start vertices), and all end vertices have outdegree 0 (and all vertices with outdegree 0 are end vertices).

So, we’re given a DAG G, and a set of start vertices s1..sn along with their matching end vertices t1..tn.  First, we will take the graph and replace all edges a= (u,v) with a pair of edges going through a new vertex just for edge a: (u, wa) and (wa, v).  That new vertex is “private” to the edge and is not used anywhere else.  Our new graph G’ has q vertices.

Since we do have a DAG, we can topologically sort the vertices.  Set up a sorting so that the first vertices in the sort are the vertices s1..sn. We can get away with this because all of the s vertices have indegree 0. Similarly, make the last vertices in the sorting be t1..tn.

For each vertex vi, define B(i) = {j:(vi, vj) is an edge}.

Now we’re ready to build our WPPSG instance.  Our X sets are:

Xj = {n+j} ∪ B(n+j) for j from 1 to q-n

The last set is Xq-n+1 = {1,2,…q-n}

The permutation we want to build is:

π(i) = q-n+i for i from 1 to n

π(i) = i-n for i from n+1 to q.

Our WPPSG instance has one position in its permutation for each vertex in G’.  Every position starts labeled by its own index.  Each set Xi contains vertex vn+1 and its immediate predecessors.  So permutation X1 will replace s1 with a label of one of its neighbors, which corresponds to moving along that edge.  Suppose we permute s1 with some vertex v.  Then when we get to Xv, we will permute with one of v’s neighbors, and so on down the chain until we get the final position on a t vertex.  The final permutation ensures that we have eventually permuted the labels to end up with their corresponding t vertices.  (It doesn’t matter what happens with the rest of the labels).

Difficulty: 7.  I like the idea of “pushing” a path through a graph into a permutation problem.

Feasible Register Assignment

On to a new section!  This section is on “Program Optimization”, and is the last “real” section- after this, there is a section of “Miscellaneous” problems and a section on “open” problems.  We’re heading towards the end!

PO1 is Register Sufficiency.  We’ll pick up with the next problem.

The problem: Feasible Register Assignment.  This is problem PO2 in the Appendix.

The description: Given a directed acyclic graph V = (G,A), a positive integer K, and a register assignment for each vertex (each vertex maps to one of K registers), can we create a computation for G (In the same sense as we did in Register Sufficiency) that also has the requirement that at each stage of the computation, we never have two vertices simultaneously assigned to the same register?

Thinking about this in terms of the “stones game” from the Register Sufficiency problem, we are saying that each vertex wants to use a specific stone, and so not only can we only use K stones (at most) at a time, we also can’t use the same stone for two different vertices that request that stone at the same time.

In the paper that has the reduction for this problem (and Register Sufficiency), Sethi makes the point that this means that we should keep a value in its register until all of its direct ancestors are computed.  So we are really asking if we can create a sequence of vertices where:

  1. For all vertices u,v, and w in the sequence, if u is a direct descendant of w, and v appears between u and w, then u and v have different register allocations.
  2. For all vertices u and v in the sequence, if u appears before v in the sequence, v is not a descendant of u

Example: Here is the example graph from Sethi’s paper that we used before:

Suppose we had the following allocation of registers to vertices (this comes from p. 245 of the paper):

node c x t1 b t2 t3 a t4
register 2 3 2 1 2 2 3 1

This allocation works if we use the sequence given in the table (c, then x, then t1, and so on).  We can see property 1 satisfied, for example, between t3 and x.  x is a direct descendant of t3. and so every vertex between x and t3 in the sequence cannot use x’s allocated register 3.  The second property is also true- we never have an ancestor vertex in the graph appear in the sequence before its descendant.

Here is a register allocation and sequence that will not work:

node c x b a t1 t2 t3 t4
register 2 3 1 3 2 2 2 1

The problem here is again between t3 and x.  In this sequence, node a also uses register 3 in between, which violates property 1.

In programming terms, what is happening is that the second sequence copies the value of a into register 3, destroying the value of x that was already in there.  But the graph says we needed x to compute t3, and so by the time we got to the part of the sequence where we computed t3, we no longer had x’s value in a register.

Reduction: Sethi goes from 3SAT.  Our 3SAT instance will have n variables and m clauses.

The basic idea is that we will construct a graph with:

  • One leaf vertex for the positive and negative versions of each literal (nodes sk and sk for variable k).  These vertices will both be allocated to register “Sk
  • For each clause, construct the following subgraph: The lowercase letters are the vertex names.  The capital letters are the register assignments of the vertices (so, the q vertices map to registers “Qi1” through “Qi3“, and so on.  Note that both the positive and negative rij vertices both map to the same register “Rij“.
  • Since r vertices represent literals in clauses, connect from them to the s vertices that represent the variables represented by the literal.  In other words, If the literal rij (literal j of clause i) is positive and represents variable xk, we connect an edge from rij to sk and from rij to sk.  If the literal is negative, we connect from rij to sk and from sk to rij

Suppose the formula is satisfiable.  Then if variable xk is true in the satisfying assignment, we add the vertices sk, all of its direct ancestors, and then sk to the sequence.  (In order from k=1 to n).  If xk is false, we add sk, then all of its ancestors, then sk.

We know that all of these ancestors are “r” vertices in clauses that have the variable and  that those “r” vertices only have 1 descendant (the s vertex corresponding to the setting of the variable.)  We also know that for all i and j, rij and rij have different children (they’ll be different parities of the corresponding s vertices).  This means we thus far have only added one vertex out of each rij and rij. Since each pair shared a register allocation (and were the only vertices sharing that register allocation), we’re ok so far.

If we have added the “positive” rij vertex to the sequence, next we will add the parent pij to the sequence.  This will “release” the need for the register allocation for rij, allowing us to add rij to the sequence.

After doing that for all rij, we need to handle the situation where we added the “negative” rij to the sequence, but not the “positive” version yet.  It turns out that in each clause, since the clause is satisfiable, we must have set its rij in the sequence already.  Let’s pretend that the literal that is true is literal 1.  So we know we have ri1 (and thus pi1) in the sequence, and since we have all “negated” r vertices, we must already also have ri2 in the sequence.  This lets us allocate the vertex qi1, which lets us release the register allocation of Ri2 that was held by ri2, letting us add ri2 to the sequence, and then we can follow that process around the clause widget.  This will then let us add all Q vertices to the sequence, giving us a legal ordering of the vertices.

 

The other direction is a little more convoluted.  First, he shows that if we have a pair of “parent” vertices (u1 and u2) with the same register allocation, and those  parents each have child vertices (v1 and v2) with the same register allocation, then u1 comes before u2 in a legal sequence iff v1 comes before v2 in that sequence.  Since parents have to come after children n the sequence, the only for that not to happen is if we had a sequence v2, v1, u1, u2.  But that sequence forces us to visit both v1 and v2 while they are both holding the same register, which can’t happen.

Next, we look at a sequence for the graph we build from a satisfiability instance.  He shows that if we have a legal sequence there must be, for all i, some rij that appears before the corresponding rij.  If that wasn’t the case, we’d have no legal way to get to the Q vertices, since the rij will hold its register allocation the whole time, meaning we can’t ever put the positive rij in the sequence.

This vertex that has its rij first corresponds to a literal being set to true in a clause and gives us a way to satisfy the entire formula.

Difficulty: 8.  This is a pretty tricky construction, even if it is similar to other SAT reductions we’ve seen.  But the need to add the extra p vertex on the positive (but not negative) side, and the Q vertex having as descendants rij and ri(j+1) make this hard to follow.

Protected: Register Sufficiency

This content is password protected. To view it please enter your password below: