Monthly Archives: March 2017

Longest Common Subsequence

I’m going out of order here, since the next few problems are all related, and this is the easiest one, and is a good way to get thinking about this kind of problem.

The problem: Longest Common Subsequence.  This is problem SR10 in the appendix.

The description: Given a finite set of strings R over some finite alphabet Σ, and a positive integer K, can I find a string w over the alphabet Σ such that w has length K or more, and w is a subsequence of each string in R?

Example: Suppose R was the strings {aab, aaab, aaaba, aaaa}

Then the longest w we could choose that is a subsequence of everything in R wold be “aa”.  Notice that if we add “ba” to R, then the longest w we can make is length 1 (the string “a” or the string “b”)

Reduction: The paper by Maier uses VC.  (This is the “LCS” problem in the paper).  So we start with a graph G = (V,E)  and a K.  Our alphabet Σ has one character for each vertex in V.  R will consist of |E|+1 strings, based on a “template string” T consisting of the characters for all of the vertices in V, in some order. The first string in R is T itself.  Then, each edge (u,v) gets a string: Two copies of T concatenated.  The first copy has u removed, the second one has v removed.  Set K’ to |V|-K.

Suppose G has a VC, call it V’.  Then take T and remove all of the vertices in V’ from it.  This remaining string will be called T’ and is of size |V|-K.  We will show that T’ is a subsequence of each string in R.  It’s pretty clearly a substring of T.

Each other string in R is built from some edge (u,v) in E.  So either u or v is in V’.  So u will be missing from T’, and u will be missing from the first copy of T in the string in R, so T’ is a substring of that first copy.  Similarly, if v is in V’, then T’ is a substring of the second copy of T.

If we find a string T’ of size K’, then first notice that for each string in R made from an edge (u,v), T’ can’t have both u and v.  The reason is that if (say) u comes before v in T, then in the string in R, u will be missing in the first half of that string, and v will be missing in the second half.  So the occurrence of v will come first in the string and the occurrence of u will come second, so they will be “swapped”.

It’s much easier to see with an example.  Suppose we have 5 vertices, and T = 12345.  If we have the edge (2,4), then the string in R is 13451235.  T’ can’t contain 24 in that order, because the 4 comes before the 2 in the string.

So this means that if we let V’ be the vertices in V that are not in T’, we have K vertices and at least one of the characters corresponding to that vertex is in each string in R, and so at least one vertex from V’ is in each edge in E.  This forms a vertex cover of V.

Difficulty: 5.  I may have downgraded this difficulty after seeing the much harder SR8 and SR9, but I do think that it is pretty straightforward.

Capacity Assignment

This problem is from the same “unpublished manuscript” as last week’s.

The problem: Capacity Assignment.  This is problem SR7 in the appendix.

The description: Given a set C of “communication links”, and set M of positive capacities.  Each pair of a link c and a capacity m also has a cost function g(c,m) and delay penalty d(c,m) that has the following properties:

  • If i < j ∈ M, then g(c,i) ≤ g(c,j)
  • If i < j ∈ M, then d(c,i) ≥ d(c,j)

We’re also given positive integers K and J.  The problem is: Can we assign a capacity to each link such that the total g cost of all of our assignments is ≤ K and the total d cost of all of our assignments is ≤ J?

Example: There’s a lot to parse in that problem description.  The first thing to notice is that the set of links C doesn’t necessarily have to link anything together (it’s not like it has to apply to an underlying graph).  So we can just give them names:

C={a,b,c,d,e}

Next, there is no reason why the set of capacities has to be assigned as a bijection to C- the set M could be a different size entirely than the size of C:

M={1,2}

The cost function has to have the property that if we assign a 2 to a link, it has to cost as least as much as assigning 1 to the link:

g(c,1) = 3 for all c

g(c,2) = 4 for all c

The delay function has to have the property that if we assign a higher capacity to a link, the delay can’t be larger than assigning a lower capacity:

d(c,1) = 6 for all c

d(c,2) = 5 for all c

In this case, if we assign the capacity of 1 to all links, we get a total cost of 15 and a total delay of 30.  If we assign the capacity of 2 to all links, we get a total cost of 20 and a total delay of 25.     If we have K = 18, and J = 27, we can achieve that by setting 2 links to have capacity 1 and 3 links to have capacity 2.

The reduction: The example above is pretty close to how the reduction will work.  We will reduce from Sum of Subsets, so we start with a set S of integers and a target B.   Our set C will have one element for each element in S.  Our set M will be {1,2}.  Assigning a capacity of 1 will imply we don’t want to take this element in S’, and assigning a capacity of 2 will imply that we do.  (This makes more sense if I can use the set {0,1} for M, but the problem description says the elements of M have to be positive)

We will define our g function so that g(c,1) = 1 for all c, and g(c,2) will be s(c)+1 (where s(c) is the size of the element in S that corresponds to c).

Our d function will work similarly:  d(c,1) = s(c)+1 for all c, and d(c,2) = 1 for all c.  These functions both follow the restrictions for how g and d work.

Set K = |S| + B.  Since each cost is either s(c)+1 or 1, this is saying that there needs to be enough elements assigned a 1 (such that its cost is 1, instead of s(c)+1) to that the sizes of those elements does not exceed K.

Let T = The sum of all of the sizes of all of the elements in S.  Then let J = |S| + T – B.  Again, each d value always includes 1, and may include s(c) as well.  So this is saying that there needs to be enough values assigned a 2 (so that its delay is 1) so that the sizes of those elements does not exceed J.

If S has a SOS solution S’, then assigning a capacity of 2 to all elements in S’ and a 1 to all elements in S’ gives us a cost value of exactly K, and a delay value of exactly J.

If we have a Capacity Assignment solution, then notice that K+J = 2|S|  + T, and so is the sum of all delays and capacities no matter what assignment is chosen.  (g(c,m) + d(c,m) = s(c)+2, for all c, no matter what m we use).  So if the sum of the delays (or costs) were strictly less than K, the sum of the costs (or delays) would have to be strictly more than J.  The only way to satisfy both the K and J constraints is to make the sums exactly equal, which gives us a SOS solution.

Difficulty: 4.  I think the algebra for this problem is a little easier than last week’s, but it does take some work to understand what the problem is asking.  Changing the problem slightly to allow assignments and costs and delays to be 0 instead of making them all be positive integers makes the reduction easier too.

Multiple Copy File Allocation

These next two problems are from an “unpublished manuscript” that I can’t find.  Assuming I did them correctly, I think they are also good problems for students to work on.

The problem: Multiple Copy File Allocation.  This is problem SR6 in the appendix.

The description: Given a graph G=(V,E), where each vertex v ∈ V has a “usage” u(v), and a “storage cost” s(v) (both positive integers), and a positive integer K.  Can we find a subset V’ of V where for each v ∈ V we define d(v) to be the length of the shortest path from v to some member of V’, that the sum of:

  • The s(v) values for all vertices in V’ plus
  • The d(v)*u(v) values for all vertices not in V’ is ≤ K?

Example: The idea is that each vertex in V’ pays its s(v) cost, and each vertex not in V’ pays its u(v) cost for each edge is has to go through to get to something in V’.  Here’s a simple example:

sr6

Each node lists its s value first and u value second.  If we take just the “1,10” node into V’, our total cost is:

  • 1 from the “1,10” node (its s value)
  • 2 + 3 + 4 from each of its neighbors (their u values * a distance of 1)
  • 2 from the “90,1” vertex.  (u cost of 1, times a distance of 2 edges from the vertex in V’)

..for a total of 12.  I think that’s minimal.

Reduction: G&J say to use  Vertex Cover, and mention that we can make all s values equal to each other and all u values equal to each other.  The phrasing implies that we shouldn’t make the s value and u values the same altogether, which is a good hint.

So we start with out VC instance: a graph G=(V,E) and an integer K.  We will use the same graph.  Let s = the s(v) for each vertex = |V|.  Let u = the u(v) for each vertex =  \lceil|V|/2 \rceil + 1 (The smallest integer larger than |V|/2).

Set K’ = K*s + (|V|-K)* u.  Hopefully it’s clear how this number will allow a solution to the VC problem take that cover and create a solution to the MCFA problem.

If we have a MFCA solution, first notice that any vertices that are 2 (or more) edges away from something in V’ are contributing 2*u to the total cost.  Since 2*u > s, we can add a vertex to V’ at a distance 1 away from these distant vertices and keep our total cost ≤ K’.  This means we can transform any MCFA solution into one that is also a solution but where V’ is a cover of V.

Also notice that even after making these changes that if we were to have K+1 (or more) vertices in V’, our total cost would be (K+1)*s + (|V|-K-1)* u.  This is more than K’ (it adds an extra s of cost |V| and removes a u of cost |V|/2+1).

So, we can construct a MCFA solution where there are K or less vertices in V’ that form a cover.  This gives us a solution to the VC instance.

Difficulty: 5.  It takes a little playing around with the algebra to get the correct values, but the idea that we can make the s and u values the same for all vertices is helpful, and gives a clear path to how you’d reduce this problem from VC.

Rooted Tree Storage Assignment

This whole section is filled with problem descriptions that confuse me.  Here’s another one:

The problem: Rooted Tree Storage Assignment.  This is problem SR5 in the appendix.

G&J’s definition: Given a set X, and a collection C= {X1..Xn} of n subsets of X, and an integer K.  Can we find a collection C’ = {X1‘..Xn‘} of n subsets of X, where:

  • Each Xi‘ contains all of the elements of it’s corresponding Xi, plus possibly some extra elements.
  • There are at most K new elements added across all Xi
  • We can treat the elements of X as vertices and add directed edges to form a directed tree where the elements of each Xi form a directed path.

Example: Suppose X = {1,2,3,4,5}, and X1 = {1,2,3,4} X2 = {1,3,4}  X3={1,3,5}, X4 = {2,5}

Then we can set each Xi‘ to be = Xi, except X3‘ where we’ll add the element 2.  This gives us the arrangement:

rooted-tree-storage-assignment

 

The way to think of the X1 is as “required elements along a path from the root”, and the Xi‘ is as “the actual elements along the path from the root”.

But the paper from Gavril that has the reduction gives what I think is the better definition:

Gavril’s Definition: (He calls this problem “Augmented directed tree arrangement”) Given a bipartite graph B, whose vertices can be split into two sets X and Y, and an integer K, can we add K or less edges to B to form a new bipartite graph such that:

  • The elements in X can be arranged as a directed tree
  • For each subset (I think- it’s not entirely clear whether this is a subset or just a single vertex) S of Y, the vertices in X that are adjacent to S can be arranged to form a directed path in the directed tree of X.

I like this better because you can see that the vertices that need to be in order in the tree  are “pointed to” by edges from Y.  The extra edges that you get in Xi‘ are created by adding more edges between X and Y.

Example: The example above would be represented by the following bipartite graph:

rooted-tree-storage-assignment2

Notice how the vertices from X are arranged to almost form a directed path with no gaps, but we need to add an edge from X2 to Y3 to fix the arrangement.

Reduction: The Gavril paper uses Rooted Tree Arrangement.  So we’re given a graph G=(V,E) and an integer K.  Our bipartite graph will have X=V, and Y=E, and each vertex in Y will connect to the 2 vertices in X that correspond to the endpoints of the edge in G. K’ = K-|E|.

If you’re worried- like I was- that this might possibly make K’ negative, keep in mind that the K for the Rooted Tree Arrangement Problem was the sum of the paths between pairs of vertices in G that are connected by an edge.  So for K to be meaningful it has to be ≥ E.

If we have a solution for the Rooted Tree Arrangement problem, then we have a path between all pairs of vertices connected by an edge in E.  So, for each vertex in Y, look at the path in the solution. If the path goes through additional vertices besides the endpoints of the edge, add an edge to our bipartite graph from the vertex in Y to the vertex passed through in X.  In doing so, we will add at most K-|E| edges to the bipartite graph, and can use the arrangement that is a solution to the Storage Assignment problem.

If we have a solution to the Storage Assignment problem, the edges that were added to the bipartite graph mark the vertices traveled along a path that solves the Rooted Tree Arrangement problem, and so tells us how to build a satisfying arrangement.

Difficulty: 6.  This isn’t the hardest reduction to understand, but it does require you to know the Rooted Tree Arrangement problem really well, and to understand just what the job of the bipartite graph is.

Expected Retrieval Cost

Here’s another problem where the G&J definition confused me for a bit.

The problem: Expected Retrieval Cost.  This is problem SR4 in the appendix.

The description: Given a set R of records, each with a probability of being accessed between 0-1 (and the sum of all probabilities = 1), some number m of sectors to place records on, and a positive integer K.  Can we partition R into m disjoint subsets R1..Rm  such that:

  • The “latency” cost of 2 sectors i and j, called d(i,j) is j-i-1  (if i < j) or m-i+j-1 (if i >=j)
  • The probability of a sector, called  p(Ri), is the sum of the probabilities of the records on that sector
  • The sum over all pairs of sectors i and j is p(Ri) * p(Rj) * d(i,j) is K or less

Example: The thing that was the hardest for me to understand was the definition of d.  The way it’s written, the distance between 2 adjacent sectors (for example d(2,3)) is 0.  The distance between a sector and itself (for example d(2,2)) is m-1.  The paper by Cody and Coffman do a better job of explaining the motivation: What we’re looking at is the time (in sectors traversed) for a disk to read sector j after finishing reading sector i.  So If we read sector 2 right before reading sector 3, the disk has no traversal time to go from the end of sector 2 to the beginning of sector 3.  But if we read sector 2 twice in a row, the disk reader (in this model) needs to scan to the end of all of the sectors, then return to the beginning, then scan all the way to the beginning of sector 2 to read again.

So, suppose we have m=2, and 4 records, each with .25 probability.  If we put them all in the same sector, we have d(i,j) = 1 for all pairs of sectors.  Since all pairs of sectors are in (say) R1, then p(R1) = 1, and p(R2) = 0.  So our sum is:

  • p(R1)*p(R1)* d(1,1) = 1*1*1 = 1, plus
  • p(R1) * p(R2) * d(1,2) = 1*0*0 = 0, plus
  • p(R2) * p(R1)* d(2,1) = 0*1*0 = 0, plus
  • p(R2)* p(R2) * d(2,2) = 0*0*1

..for a total of 1.

If we put 2 records in sector 1, and 2 records in sector 2, then p(R1) = p(R2) = .5.  So our sum is:

  • p(R1)*p(R1)* d(1,1) = .5*.5*1 = .25, plus
  • p(R1) * p(R2) * d(1,2) = .5*.5*0 = 0, plus
  • p(R2) * p(R1)* d(2,1) = .5*1.5*0 = 0, plus
  • p(R2)* p(R2) * d(2,2) = .5*.5*1 = .25

..for a total of .5.

The reduction: Hopefully the example using m=2 helps to show why using Partition is a good choice.  So we start with a set S of elements.  We will turn each element S into a value between 0 and 1 reflecting its proportion of the sum of all of the elements.  For example, if S={1,2,3,4,5}, then we would create a set R of values {1/15, 2/15, 3/15, 4/15, 5/15}.  These probabilities will all be between 0 and 1 and will all sum to 1.

We will set m=2, K = 1/2. Notice that d(1,2) = d(2,1) = 0.  So the only d values that will count for our sum is d(1,1) and d(2,2) (which are both 1)  So by our formula we need p(R1) * p(R2) + p(R2) * p(R1) = .5.

Some algebra tells us that this means that p(R1)*p(R2) = ..25, and we know that p(R1) + p(R2) = 1.  Solving that system of equations gets us p(R1) = p(R2) = .5.  Or, we have an Expected Retrieval Cost solution for R exactly when we have  a partition of S.

Difficulty: 4. Cody and Coffman say the details of the above reduction are “routine” after defining k = 1/2.  It is pretty straightforward, but there are some tricky parts to worry about.

I will say though that the definition in G&J, where it’s not clear how distances to adjacent things can be 0, struck me as much harder, and is the reason I dug up the Cody and Coffman paper in the first place.  I’d say that definition makes the problem a 6 or 7.