# Category Archives: Appendix: Storage and Retrieval

## Hitting String

This is a good easy problem hidden in the middle of all of the hard ones.

The problem: Hitting String.  This is problem SR12 in the appendix.

The description: Given a set of strings A over the alphabet {0,1,*} all of the same length n, can we find a string x over the alphabet {0,1}, also of length n, where x agrees in at least one position with each string in A?

Example: Let A = {00000,11111,0*001, ***10, 101**, 1****}

Then x can be 10100.  It agrees with 00000 in the last position, 11111 in the first position, 0*001 in the fourth position, ***10 in the last position, and 1***** in the first position.

A pretty easy example of an instance that can’t be solved is A = {1**, 0**}, or even A = {***}

Reduction: We’ll go from 3SAT.  Each clause in the 3SAT instance will become a string in A.  Each string in A will have one position for each variable in the formula.  Each string will have a 1 in each variable’s position if that variable occurs positively in that clause, a 0 if it occurs negatively, and a * for all variables that don’t appear in the clause.  So each string will have just 3 characters that are not *.

Thus, we need to come up with a string x that has a 1 or 0 in each position (corresponding to a setting of a variable) that matches one of the three 1 or 0 characters in each string (satisfying that clause).

Difficulty: 4, maybe 3.  This is about as straightforward a “Turn SAT into a different kind of problem” reduction as you’re likely to see, but I do think crossing genre problems are harder for students than we may anticipate them to be.

## Bounded Post Correspondence Problem

I’m sure most people remember the “Post Correspondence Problem” from Theory of Computation classes.  It’s mainly famous for two reasons:

• It’s the classic example of an undecidable problem that isn’t about a language (like the Halting Problem is)
• It leads to lots of fun (meaning “bad”) PCP jokes.  (“I need some PCP to be able to handle thinking about the PCP”)

Anyway, here is a restricted version of the problem that at least is decidable, if intractable.

The problem: Bounded Post Correspondence Problem.  This is problem SR11 in the appendix.

The description: Given two sequences of strings A = and B = from a finite alphabet, and a positive integer K, can I find a sequence of integers i1..ik, where k ≤ K, and the string A’= is identical to the string B’ =?

Example: The difference between this and “classic” PCP is the introduction of K, which makes the problem decidable because if nothing else you can just try all combinations of strings from A and B that are length K or less, which is finite.

So, any PCP instance that “works” is a good example to use here.  Here’s one that I got from my old Hopcroft and Ullman theory of computation book:

A =  {1, 10111,10}  B = {111,10,0}

We can construct the string 101111110.  From A, it’s built from 10111/1/1/10.  From B, it’s built from 10/111/111/0

Notice that we need to use the same elements from each string in the sequence (The second one, then the first one twice, then the third one).

Reduction: The reference in G&J is to a technical report by Constable, Hunt, and Sahni and says they use a “generic transformation”.   In the paper, that means they show it’s NP-Complete by showing it’s a “negation of freedom for a 2 variable loop free scheme”  Which is, as near as I can tell, a way of defining programs to be equivalent.

I’ll admit that I don’t really understand it.  I really want there to be a simpler reduction out there.  I found this post on stackexchange,  but it reduces from Shortest Common Supersequence, and uses a |R| of 2, and G&J say that Shortest Common Supersequence is polynomial with |R|=2, so something has to be wrong there.

I was tinkering with a reduction from Directed Hamiltonian Cycle, where the strings in A corresponded to edges, and the strings in B correspond to vertices.  So, for a graph like this:

You would get strings like this:

 A B x xa ab bb bc cc ca a cd dd

The idea is that each string in A corresponds to a directed edge, and each string in B corresponds to 2 copies of the destination of that edge.  The “x” is there to give us a starting point- we arbitrarily choose some starting vertex to be the start.  All edges that have this starting vertex as its destination only have 1 copy (instead of 2) of that vertex in B.

The problem is that a string like xabbcca  (corresponding to a non-Hamiltonian cycle) is a solution to the PCP problem, but is not a solution to the HC problem.  I don’t know how to force the string to be long enough to visit all vertices- I thought about working on a variant of the problem where you force the size of the string you create to be exactly =K instead of ≤ K.  But  since a cycle that doesn’t involve the start vertex can be “pumped” several times to make a string longer, that won’t work either.  There is probably something clever you can do with the strings to force you to visit all vertices, but I can’t see it.

Difficulty: 10, sadly.  I’m sure there is something easier out there.

## Shortest Common Superstring

I’ve been doing this for long enough now that I start seeing the same paper come up multiple times in multiple situations.  When I looked up the reference for this problem, I remembered that I first got this paper as a reference for the Vertex Cover on Cubic Graphs problem, and now we’re doing the problem that they actually needed that problem for.  It was kind of cool to go into my file cabinet and say “Oh yeah, I remember this paper”

(In my VC3 post, I noted that this paper was currently my “Most Obscure Reference”.  That title has been supplanted by the Rosenthal PhD thesis I needed for the Network Survivability problem, which they sent me a hard copy of and wouldn’t let me duplicate  it or remove it from the library.   That might never be passed, though there is a problem coming up (Sparse Matrix Compression) that I still haven’t found a reference for.)

The problem: Shortest Common Substring.  This is problem SR9 in the appendix.

The description: Given a finite set R of strings from some finite alphabet, and a positive integer K, can we find a string w of length K or less such that each string in R is a substring of W?

Example: Notice how similar this is to last week’s Shortest Common Supersequence problem.  The difference is that subersequences can have gaps in the matching, but superstrings can’t.  So if we use the strings from last week: {aab, aaab, aaaba, aaaa}, our w needs to have each of these as a substring (in consecuitive positions in w).  The string “aaaaba” is the shortest I could find.  If we add “bb” to R, I think the only way to do that is to add the whole string bb either at the start or the end, since you need both bb and aaba consecutively in w.

Reduction: The paper by Maier and Storer has the reduction.  This is why they needed to prove Vertex Cover on cubic graphs was NP-Complete in the first place.  They show that this problem is true even in the case where all strings in R are the same length H, as long as H is at least 8.  For our purposes, we can get away with setting H=8, and moving on.

So, suppose we’re given a graph G=(V,E), where all vertices in G have degree 3, and an integer K.  For each vertex, we’ll label it’s 3 edges with a number between 0 and 2.  For vertex v and edge e, call this anumber g(a,e).  They don’t enforce a rule that if an edge e=(a,b) that g(a,e) = g(b,e) .

Each vertex a will generate 5 characters in the alphabet: a, a’, a0, a1, and a2. There is also a character \$ not used anywhere else.  Each vertex generates 7 strings:  (Recall that H=8 for us, but I’m leaving it as H to make it consistent with the paper.

1. \$H-2aa0
2. a0\$H-2a1
3. a1a’\$H-4aa1
4. a1\$H-2a2
5. a2a’\$H-4aa2
6. a2\$H-2a0
7. a0a’\$H-2

Note that each of those strings is length H

We also define two components for each edge e=(a,b):

• Ea = aag(a,e)a(g(a,e)+1)%3a’
• Eb = bbg(b,e)b(g(b,e)+1)%3b’

..both of size 4.  From that we build a string of length H we add to R:

Eab=Ea\$H-8Eb and

Eba = Eb\$H-8Ea

..both of length H.  R has 7*|V|+2*|E| strings in it.  Set K’ = |R| + K – 11|E|.

Notice that the 7 vertex strings all overlap (the end character of one is the starting character of another).  The two strings for each edge also overlap.  So a “standard” superstring that contains all strings as a substring with these overlapping characters is of length |R| – 6|V| – 4 |E|.  Since |V| = 2/3 |E| (because G is cubic), that is length |R| – 8 |E|.

If we are given a VC of G, then suppose some vertex v is in the cover.  The edge (v,w) is connected with v.  Then either string 1,3, or 5 from the vertex list overlaps with the beginning of the string corresponding with (v,w), and either string 3,5, or 7 can overlap with the end of the edge string corresponding with (v,w).  The paper shows how making a series of these improvements, if we have a vertex cover, we can take the standard substring and bring the length down to K’.

In the other direction, if we have a minimal length superstring, we need to use it to build a cover of G.  The paper does this by showing that to be of this length, there need to be overlaps of the kind described above, and from those overlaps, we get our cover.

Difficulty: 8.  I wish there was an easy reduction to this from Shortest Common Supersequence, but I can’t find it.  I also wish that I could think of a better notation than g(a,e) because having them as subscripts and having that “mean” a character like a0-a2 is hard to follow.

## Shortest Common Supersequence

Now to go back to the problems I skipped over.  This one uses a more complicated version of the “template string” used in the Longest Common Subsequence reduction.

I also think this reduction has a small mistake in the paper, though it’s possible I’m misunderstanding something.  Anyone who’s interested is encouraged to check the paper, and my work to see what’s going on.

The problem: Shortest Common Supersequence.  This is problem SR8 in the appendix.

The description: Given a finite set of strings R over some finite alphabet Σ, and a positive integer K. can I find a string w over the alphabet Σ of size K or less such that each string in R is a subsequence of w?

Example: Here’s the R from last time: {aab, aaab, aaaba, aaaa}

If we choose w to be aaaba, each string in R is a subsequence of w.  (Remember subsequence is different from substring in that subsequences are allowed to have gaps.  So for example aaaa is a subsequence of aaaba.).  If we add the string “bb” to R, then  w needs to have an extra b someplace (some candidates for w are aaabba, aaabab, baaaba, and so on).

Reduction: The same paper by Maier as last week has this reduction too (This is the “SCS” problem in the paper).  The reduction is again from VC, so we’re given a graph G=(V,E) and an integer K.   The alphabet ∑ will have one symbol for each vertex in V, one symbol for each edge in E, and a special * symbol.  Define c = max(|V|,|E|).  Also note that the vertices and edges are “ordered” somehow (it doesn’t matter the exact ordering) so that we can have an order of the vertex and edge characters in ∑.

Just as in the LCS problem, the strings in R will be based on a template string.  The template string T consists of (in order):

• All of the alphabet symbols corresponding to the vertices in V, in order
• A section of 4c stars
• All of the alphabet symbols corresponding to the edges in E, in order, twice in a row (so 2 copies of the first edge symbol, then 2 copies of the second edge symbol, and so on)
• All of the alphabet symbols corresponding to the edges in E, in order, twice in a row (again)
• Another section of 4c stars
• All of the alphabet symbols corresponding to the vertices in V, in order (again)

This template string is length 8c+4|E|+2|V|

T will be in R, and we will have one string in R for each edge in E.  For an edge e=(vi, vj), we create a string with (in order):

1. Two copies of the alphabet symbol for e
2. The alphabet symbol for vi
3. A section of 4c stars
4. The alphabet symbol for vj
5. Two copies of the alphabet symbol for e (again)

Set K’ = 8c+6|E| + 2|V| + K

If G has a VC V’ of size K, then each edge has at least one vertex in V’.  Let W be the set of edges whose “first” vertex occurs in the cover (by “first”, we mean the vertex that comes alphabetically first in the ordering we used to make ∑), and let U be the set of all remaining edges (and thus for all edges in U, the “second” vertex of the edge is in V’).

Take our template string T and add:

• The alphabet symbol for each edge in W, twice, at the beginning of T
• The alphabet symbols for each vertex in V’ in between the two copies of the edge lists in the middle (between the third and fourth bullets in the above definition of T)
• The alphabet symbol for each edge in U, twice, at the end of T.

This adds 2|E|+k to T’s size  (each edge is either in W or U, and will appear twice), and so this new T’ is the right size for a SCS solution.

Now we need to show that each string in R is a subsequence of this new T’.  T is obviously a subsequence of T’.  Each string in R that is based on an edge e is a subsequence by figuring whether it’s in W or U.  It may help to break down what T’ looks like:

1. 2 copies of W
2. 1 copy of V
3. 4c stars
4. 1 copy of E
5. 1 copy of V’
6. 1 copy of E
7. 4c stars
8. 1 copy of V
9. 2 copies of U

I personally think this is a mistake in the paper.  I think U needs to go at the beginning and W needs to go at the end, see below.

If e is in W, what we do is:

• Map section a of the string to section 4 of T’
• Map section b of the string to section 5 of T’  (We know its in those vertices, because e is in W, so its first vertex is in the cover)
• Map section c of the string to section 7 of T’
• Map section d of the string to section 8 of T’
• Map section e of the string to section 9 of T’

That last part is why I think there is a mistake either in the paper, or in the way I understand it.  You can only map the edge to the last part of T’ if the edge is in that part.  So that part needs to be W, not U.  I can’t find a way to make a mapping if W is first.

If e is in U, what we do is:

• Map section a of the string to section 1 of T’ (again, assuming the paper has the U’s and W’s backwards)
• Map section b of the string to section 2 of T’
• Map section c of the string to section 3 of T’
• Map section d of the string to section 5 of T’ (the edge being in U means its second vertex is in the cover)
• Map section e of the string to section 6 of T’

So, T’ is a supersequence of all strings in R.

In the other direction, suppose the strings in R have a supersequence of length K’.  Call that supersequence T” (since it’s not necessarily T’).  To be a supersequence of T,  T” needs to contain two sets of 4c stars someplace, and they will be consecutive (otherwise T” is too long).   Each other string in R will have it’s stars map to either the left set of 4c stars in T or the right set (they won’t split), otherwise T” is too long.

Suppose for some string in R, the stars go to the part of T” that maps the left set of stars in T.  Since there are no edge characters in T before the left set of stars, there needs to be characters added to T” to handle part a of the edge string.  (If the edge string’s stars map to the right end, we have a similar problem at the end of the string). We will also need a place to put the second vertex of the edge (part d of the string) before the copy of the edges in E.

The only way to add these extra symbols to T (forming T”) and not exceed K’ characters in length is to build something like T’ which starts with T, adds a vertex cover in the middle, and adds edges whose first vertex is not in the cover on one side of T, and edges whose second vertex is not in the cover on the other side.

Difficulty: 7.  It’s very hard to see why you need 2 copies of each edge, or why 4c is the number of stars you need.  But the idea of a construction where vertices are on the outside and edges are on the inside (in T) that needs to match up with strings with edges on the outside and vertices on the inside (in the rest of R) is something that I think students can follow, even if I doubt they could come up with it themselves.

## Longest Common Subsequence

I’m going out of order here, since the next few problems are all related, and this is the easiest one, and is a good way to get thinking about this kind of problem.

The problem: Longest Common Subsequence.  This is problem SR10 in the appendix.

The description: Given a finite set of strings R over some finite alphabet Σ, and a positive integer K, can I find a string w over the alphabet Σ such that w has length K or more, and w is a subsequence of each string in R?

Example: Suppose R was the strings {aab, aaab, aaaba, aaaa}

Then the longest w we could choose that is a subsequence of everything in R wold be “aa”.  Notice that if we add “ba” to R, then the longest w we can make is length 1 (the string “a” or the string “b”)

Reduction: The paper by Maier uses VC.  (This is the “LCS” problem in the paper).  So we start with a graph G = (V,E)  and a K.  Our alphabet Σ has one character for each vertex in V.  R will consist of |E|+1 strings, based on a “template string” T consisting of the characters for all of the vertices in V, in some order. The first string in R is T itself.  Then, each edge (u,v) gets a string: Two copies of T concatenated.  The first copy has u removed, the second one has v removed.  Set K’ to |V|-K.

Suppose G has a VC, call it V’.  Then take T and remove all of the vertices in V’ from it.  This remaining string will be called T’ and is of size |V|-K.  We will show that T’ is a subsequence of each string in R.  It’s pretty clearly a substring of T.

Each other string in R is built from some edge (u,v) in E.  So either u or v is in V’.  So u will be missing from T’, and u will be missing from the first copy of T in the string in R, so T’ is a substring of that first copy.  Similarly, if v is in V’, then T’ is a substring of the second copy of T.

If we find a string T’ of size K’, then first notice that for each string in R made from an edge (u,v), T’ can’t have both u and v.  The reason is that if (say) u comes before v in T, then in the string in R, u will be missing in the first half of that string, and v will be missing in the second half.  So the occurrence of v will come first in the string and the occurrence of u will come second, so they will be “swapped”.

It’s much easier to see with an example.  Suppose we have 5 vertices, and T = 12345.  If we have the edge (2,4), then the string in R is 13451235.  T’ can’t contain 24 in that order, because the 4 comes before the 2 in the string.

So this means that if we let V’ be the vertices in V that are not in T’, we have K vertices and at least one of the characters corresponding to that vertex is in each string in R, and so at least one vertex from V’ is in each edge in E.  This forms a vertex cover of V.

Difficulty: 5.  I may have downgraded this difficulty after seeing the much harder SR8 and SR9, but I do think that it is pretty straightforward.

## Capacity Assignment

This problem is from the same “unpublished manuscript” as last week’s.

The problem: Capacity Assignment.  This is problem SR7 in the appendix.

The description: Given a set C of “communication links”, and set M of positive capacities.  Each pair of a link c and a capacity m also has a cost function g(c,m) and delay penalty d(c,m) that has the following properties:

• If i < j ∈ M, then g(c,i) ≤ g(c,j)
• If i < j ∈ M, then d(c,i) ≥ d(c,j)

We’re also given positive integers K and J.  The problem is: Can we assign a capacity to each link such that the total g cost of all of our assignments is ≤ K and the total d cost of all of our assignments is ≤ J?

Example: There’s a lot to parse in that problem description.  The first thing to notice is that the set of links C doesn’t necessarily have to link anything together (it’s not like it has to apply to an underlying graph).  So we can just give them names:

C={a,b,c,d,e}

Next, there is no reason why the set of capacities has to be assigned as a bijection to C- the set M could be a different size entirely than the size of C:

M={1,2}

The cost function has to have the property that if we assign a 2 to a link, it has to cost as least as much as assigning 1 to the link:

g(c,1) = 3 for all c

g(c,2) = 4 for all c

The delay function has to have the property that if we assign a higher capacity to a link, the delay can’t be larger than assigning a lower capacity:

d(c,1) = 6 for all c

d(c,2) = 5 for all c

In this case, if we assign the capacity of 1 to all links, we get a total cost of 15 and a total delay of 30.  If we assign the capacity of 2 to all links, we get a total cost of 20 and a total delay of 25.     If we have K = 18, and J = 27, we can achieve that by setting 2 links to have capacity 1 and 3 links to have capacity 2.

The reduction: The example above is pretty close to how the reduction will work.  We will reduce from Sum of Subsets, so we start with a set S of integers and a target B.   Our set C will have one element for each element in S.  Our set M will be {1,2}.  Assigning a capacity of 1 will imply we don’t want to take this element in S’, and assigning a capacity of 2 will imply that we do.  (This makes more sense if I can use the set {0,1} for M, but the problem description says the elements of M have to be positive)

We will define our g function so that g(c,1) = 1 for all c, and g(c,2) will be s(c)+1 (where s(c) is the size of the element in S that corresponds to c).

Our d function will work similarly:  d(c,1) = s(c)+1 for all c, and d(c,2) = 1 for all c.  These functions both follow the restrictions for how g and d work.

Set K = |S| + B.  Since each cost is either s(c)+1 or 1, this is saying that there needs to be enough elements assigned a 1 (such that its cost is 1, instead of s(c)+1) to that the sizes of those elements does not exceed K.

Let T = The sum of all of the sizes of all of the elements in S.  Then let J = |S| + T – B.  Again, each d value always includes 1, and may include s(c) as well.  So this is saying that there needs to be enough values assigned a 2 (so that its delay is 1) so that the sizes of those elements does not exceed J.

If S has a SOS solution S’, then assigning a capacity of 2 to all elements in S’ and a 1 to all elements in S’ gives us a cost value of exactly K, and a delay value of exactly J.

If we have a Capacity Assignment solution, then notice that K+J = 2|S|  + T, and so is the sum of all delays and capacities no matter what assignment is chosen.  (g(c,m) + d(c,m) = s(c)+2, for all c, no matter what m we use).  So if the sum of the delays (or costs) were strictly less than K, the sum of the costs (or delays) would have to be strictly more than J.  The only way to satisfy both the K and J constraints is to make the sums exactly equal, which gives us a SOS solution.

Difficulty: 4.  I think the algebra for this problem is a little easier than last week’s, but it does take some work to understand what the problem is asking.  Changing the problem slightly to allow assignments and costs and delays to be 0 instead of making them all be positive integers makes the reduction easier too.

## Multiple Copy File Allocation

These next two problems are from an “unpublished manuscript” that I can’t find.  Assuming I did them correctly, I think they are also good problems for students to work on.

The problem: Multiple Copy File Allocation.  This is problem SR6 in the appendix.

The description: Given a graph G=(V,E), where each vertex v ∈ V has a “usage” u(v), and a “storage cost” s(v) (both positive integers), and a positive integer K.  Can we find a subset V’ of V where for each v ∈ V we define d(v) to be the length of the shortest path from v to some member of V’, that the sum of:

• The s(v) values for all vertices in V’ plus
• The d(v)*u(v) values for all vertices not in V’ is ≤ K?

Example: The idea is that each vertex in V’ pays its s(v) cost, and each vertex not in V’ pays its u(v) cost for each edge is has to go through to get to something in V’.  Here’s a simple example:

Each node lists its s value first and u value second.  If we take just the “1,10” node into V’, our total cost is:

• 1 from the “1,10” node (its s value)
• 2 + 3 + 4 from each of its neighbors (their u values * a distance of 1)
• 2 from the “90,1” vertex.  (u cost of 1, times a distance of 2 edges from the vertex in V’)

..for a total of 12.  I think that’s minimal.

Reduction: G&J say to use  Vertex Cover, and mention that we can make all s values equal to each other and all u values equal to each other.  The phrasing implies that we shouldn’t make the s value and u values the same altogether, which is a good hint.

So we start with out VC instance: a graph G=(V,E) and an integer K.  We will use the same graph.  Let s = the s(v) for each vertex = |V|.  Let u = the u(v) for each vertex =  (The smallest integer larger than |V|/2).

Set K’ = K*s + (|V|-K)* u.  Hopefully it’s clear how this number will allow a solution to the VC problem take that cover and create a solution to the MCFA problem.

If we have a MFCA solution, first notice that any vertices that are 2 (or more) edges away from something in V’ are contributing 2*u to the total cost.  Since 2*u > s, we can add a vertex to V’ at a distance 1 away from these distant vertices and keep our total cost ≤ K’.  This means we can transform any MCFA solution into one that is also a solution but where V’ is a cover of V.

Also notice that even after making these changes that if we were to have K+1 (or more) vertices in V’, our total cost would be (K+1)*s + (|V|-K-1)* u.  This is more than K’ (it adds an extra s of cost |V| and removes a u of cost |V|/2+1).

So, we can construct a MCFA solution where there are K or less vertices in V’ that form a cover.  This gives us a solution to the VC instance.

Difficulty: 5.  It takes a little playing around with the algebra to get the correct values, but the idea that we can make the s and u values the same for all vertices is helpful, and gives a clear path to how you’d reduce this problem from VC.

## Rooted Tree Storage Assignment

This whole section is filled with problem descriptions that confuse me.  Here’s another one:

The problem: Rooted Tree Storage Assignment.  This is problem SR5 in the appendix.

G&J’s definition: Given a set X, and a collection C= {X1..Xn} of n subsets of X, and an integer K.  Can we find a collection C’ = {X1‘..Xn‘} of n subsets of X, where:

• Each Xi‘ contains all of the elements of it’s corresponding Xi, plus possibly some extra elements.
• There are at most K new elements added across all Xi
• We can treat the elements of X as vertices and add directed edges to form a directed tree where the elements of each Xi form a directed path.

Example: Suppose X = {1,2,3,4,5}, and X1 = {1,2,3,4} X2 = {1,3,4}  X3={1,3,5}, X4 = {2,5}

Then we can set each Xi‘ to be = Xi, except X3‘ where we’ll add the element 2.  This gives us the arrangement:

The way to think of the X1 is as “required elements along a path from the root”, and the Xi‘ is as “the actual elements along the path from the root”.

But the paper from Gavril that has the reduction gives what I think is the better definition:

Gavril’s Definition: (He calls this problem “Augmented directed tree arrangement”) Given a bipartite graph B, whose vertices can be split into two sets X and Y, and an integer K, can we add K or less edges to B to form a new bipartite graph such that:

• The elements in X can be arranged as a directed tree
• For each subset (I think- it’s not entirely clear whether this is a subset or just a single vertex) S of Y, the vertices in X that are adjacent to S can be arranged to form a directed path in the directed tree of X.

I like this better because you can see that the vertices that need to be in order in the tree  are “pointed to” by edges from Y.  The extra edges that you get in Xi‘ are created by adding more edges between X and Y.

Example: The example above would be represented by the following bipartite graph:

Notice how the vertices from X are arranged to almost form a directed path with no gaps, but we need to add an edge from X2 to Y3 to fix the arrangement.

Reduction: The Gavril paper uses Rooted Tree Arrangement.  So we’re given a graph G=(V,E) and an integer K.  Our bipartite graph will have X=V, and Y=E, and each vertex in Y will connect to the 2 vertices in X that correspond to the endpoints of the edge in G. K’ = K-|E|.

If you’re worried- like I was- that this might possibly make K’ negative, keep in mind that the K for the Rooted Tree Arrangement Problem was the sum of the paths between pairs of vertices in G that are connected by an edge.  So for K to be meaningful it has to be ≥ E.

If we have a solution for the Rooted Tree Arrangement problem, then we have a path between all pairs of vertices connected by an edge in E.  So, for each vertex in Y, look at the path in the solution. If the path goes through additional vertices besides the endpoints of the edge, add an edge to our bipartite graph from the vertex in Y to the vertex passed through in X.  In doing so, we will add at most K-|E| edges to the bipartite graph, and can use the arrangement that is a solution to the Storage Assignment problem.

If we have a solution to the Storage Assignment problem, the edges that were added to the bipartite graph mark the vertices traveled along a path that solves the Rooted Tree Arrangement problem, and so tells us how to build a satisfying arrangement.

Difficulty: 6.  This isn’t the hardest reduction to understand, but it does require you to know the Rooted Tree Arrangement problem really well, and to understand just what the job of the bipartite graph is.

## Expected Retrieval Cost

Here’s another problem where the G&J definition confused me for a bit.

The problem: Expected Retrieval Cost.  This is problem SR4 in the appendix.

The description: Given a set R of records, each with a probability of being accessed between 0-1 (and the sum of all probabilities = 1), some number m of sectors to place records on, and a positive integer K.  Can we partition R into m disjoint subsets R1..Rm  such that:

• The “latency” cost of 2 sectors i and j, called d(i,j) is j-i-1  (if i < j) or m-i+j-1 (if i >=j)
• The probability of a sector, called  p(Ri), is the sum of the probabilities of the records on that sector
• The sum over all pairs of sectors i and j is p(Ri) * p(Rj) * d(i,j) is K or less

Example: The thing that was the hardest for me to understand was the definition of d.  The way it’s written, the distance between 2 adjacent sectors (for example d(2,3)) is 0.  The distance between a sector and itself (for example d(2,2)) is m-1.  The paper by Cody and Coffman do a better job of explaining the motivation: What we’re looking at is the time (in sectors traversed) for a disk to read sector j after finishing reading sector i.  So If we read sector 2 right before reading sector 3, the disk has no traversal time to go from the end of sector 2 to the beginning of sector 3.  But if we read sector 2 twice in a row, the disk reader (in this model) needs to scan to the end of all of the sectors, then return to the beginning, then scan all the way to the beginning of sector 2 to read again.

So, suppose we have m=2, and 4 records, each with .25 probability.  If we put them all in the same sector, we have d(i,j) = 1 for all pairs of sectors.  Since all pairs of sectors are in (say) R1, then p(R1) = 1, and p(R2) = 0.  So our sum is:

• p(R1)*p(R1)* d(1,1) = 1*1*1 = 1, plus
• p(R1) * p(R2) * d(1,2) = 1*0*0 = 0, plus
• p(R2) * p(R1)* d(2,1) = 0*1*0 = 0, plus
• p(R2)* p(R2) * d(2,2) = 0*0*1

..for a total of 1.

If we put 2 records in sector 1, and 2 records in sector 2, then p(R1) = p(R2) = .5.  So our sum is:

• p(R1)*p(R1)* d(1,1) = .5*.5*1 = .25, plus
• p(R1) * p(R2) * d(1,2) = .5*.5*0 = 0, plus
• p(R2) * p(R1)* d(2,1) = .5*1.5*0 = 0, plus
• p(R2)* p(R2) * d(2,2) = .5*.5*1 = .25

..for a total of .5.

The reduction: Hopefully the example using m=2 helps to show why using Partition is a good choice.  So we start with a set S of elements.  We will turn each element S into a value between 0 and 1 reflecting its proportion of the sum of all of the elements.  For example, if S={1,2,3,4,5}, then we would create a set R of values {1/15, 2/15, 3/15, 4/15, 5/15}.  These probabilities will all be between 0 and 1 and will all sum to 1.

We will set m=2, K = 1/2. Notice that d(1,2) = d(2,1) = 0.  So the only d values that will count for our sum is d(1,1) and d(2,2) (which are both 1)  So by our formula we need p(R1) * p(R2) + p(R2) * p(R1) = .5.

Some algebra tells us that this means that p(R1)*p(R2) = ..25, and we know that p(R1) + p(R2) = 1.  Solving that system of equations gets us p(R1) = p(R2) = .5.  Or, we have an Expected Retrieval Cost solution for R exactly when we have  a partition of S.

Difficulty: 4. Cody and Coffman say the details of the above reduction are “routine” after defining k = 1/2.  It is pretty straightforward, but there are some tricky parts to worry about.

I will say though that the definition in G&J, where it’s not clear how distances to adjacent things can be 0, struck me as much harder, and is the reason I dug up the Cody and Coffman paper in the first place.  I’d say that definition makes the problem a 6 or 7.

## Pruned Trie Space Minimization

This problem is hard to explain, partially because the definition given by G&J doesn’t really map to the structure they are talking about easily.

The problem: Pruned Trie Space Minimization.  This is problem SR3 in the appendix.

The description in G&J: Given a finite set S, a collection F of functions mapping elements of S to positive integers, and a positive integer K.  Can we find a sequence of m distinct functions from F <f1 .. fm> such that:

• For each pair of elements a and b in S, there is some function fi in the sequence where fi(a) ≠ fi(b)
• For each i from 1 to m, define N(i) to be the number of distinct tuples X= (x1..xi) where more than one a in S has the tuple (f1(a), …, fi(a)) = X, the sum of all of the N(i) values is at most K?

A better description: G&J’s definition removes all knowledge of the “tries” from the problem.  The Comer and Sethi paper that is referred to in the appendix I think does a better job.

First, a trie is a tree that separates a sequence of strings by letters. The idea is that each string has a unique path through the tree.  Here is the tree used in the paper:

This trie shows the path for the set of strings: {back, bane, bank, bare, barn, band, bang, barb, bark, been} by building the tree by considering letters in the string from left to right.  By using different orders of considering letters, we will get differently shaped tries, with different numbers of internal nodes.

A pruned trie recognizes that long paths of nodes with 1 child doesn’t actually need to be represented.  For example, once you go down the “b-e” side, the only place you can end up is at “been”.  So the trie is pruned by removing all such chains (we would consider the “e” node a leaf).

What we are interested in doing is finding an ordering on the letters in the string (or, more generally, the “attributes” of an element we are trying to distinguish) in order to minimize the number of nonleaf nodes in the pruned trie.

The actual question we want to solve is: Given a set of strings S and an integer K, can we construct a trie that differentiates the S strings with K or less internal nodes?

I think the way this maps to the G&J definition is:

S is the set of strings.  F is the set of attributes that map strings to an order of choosing attributes.  The sequence of functions <f1, …, fn> are the orders in which we choose attributes.  So f1(a) is the first node in the trie that we go to on the string a, f2(a) is the second node we go to and so on.  The fi(a) ≠ fi(b) requirement says that we need to eventually differentiate each string from each other, and the N(i) number is counting the number of internal nodes at each height of the tree:

Example: For the picture shown above, we get the following pruned trie (also from the paper):

This trie has 5 internal nodes.

Reduction: G&J say that the reduction goes from 3DM, but in the paper it goes from 3SAT. So we’ll start with a formula in 3CNF form with n variables and m clauses.  The strings we’ll build will have 3n+3m attributes (you can think of this as strings of length 3n+3m).    The first 2n attributes will correspond to literals (one attribute for the positive setting of a variable, one attribute for the negative setting).  The next 3m attributes will correspond to clauses (3 attributes for the 3 possible positions a variable can appear in a clause), and the last 3 attributes correspond to literals (to combine the positive and negative setting of that variable’s literals).

We will have one string for each literal (a 1 in the attribute matching that literal’s positive or negative setting, a 1 in the attributes matching that literal’s position in clauses, and a 1 in the attribute matching that variable, 0’s everywhere else).  We will have one string for each clause (a 1 in the three positions in each clause, 0’s everywhere else).  Then we will have a sequence of “hard to distinguish” strings made of decreasing numbers of 2’s (with 0’s everywhere else).

Here’s the example construction from the paper (blank spaces are zero’s).  It’s a little confusing because they chose n=m=3, but you can see where the various pieces are:

K=2n+m.

If the formula is satisfiable, then the ordering of attributes where we put all of the literals that form the satisfying arrangement first, then all of the clauses, then the W attributes (for the variables) distinguishes the strings in L with 2n+m internal nodes.

In fact, all tries must have at least K internal nodes to distinguish the strings in L- that can be seen from the table, since we have K strings made up of decreasing numbers of 2’s.  We also have to distinguish the strings in order (the strings with the most 2’s first, then the ones with less 2’s, all the way down to the last one with just one 2).  We need to choose one attribute for each literal (positive or negative).  Suppose we choose an attribute Ui (or its negation).  That node in the trie has 3 children:

• A 2, which distinguishes the string in L.
• A 1, which distinguishes the string corresponding to that literal in J.
• A 0, for everything else.

What this means is that we have “distinguished off” the literal string (in J) from the rest (on a 1), which means that the 1 it has in the clause position will not interfere with the 1 in that position of the clause string (in K).  So each clause string will be able to be distinguished by the clause position that satisfies the string.

So, if we have a trie with “only” K internal nodes, the attributes must line up to allow us to have a setting of a variable to satisfy each clause.

Difficulty: 8, with the Comer and Sethi trie definition.  If you are going straight from G&J’s definitions, it’s at least a 9.