Consecutive Ones Matrix Augmentation

While I had the dissertation I needed for the last problem, and before I had to send it back (without making copies), I skipped ahead to another problem that used it as a reference.

The problem: Consecutive Ones Matrix Augmentation.  This is problem SR16 in the appendix.

The description: Given an mxn matrix of 0’s and 1’s and a positive integer K, can we change K or fewer 0’s into 1’s to give the matrix the consecutive ones property.  See the Consecutive Ones Submatrix problem for more detail on this property.

Example: Here is the matrix that did not have the property from last week:

0 1 1 1 1
1 1 1 0 1
0 0 1 1 0
0 1 1 0 0
1 1 0 0 0

Recall that we’re looking for a way to  permute the columns so that there are no gaps in the 1’s in each row.  The easy solution here is to add a 1 in the second row (making it all 1’s).  In that case, the matrix has the consecutive ones property without needing any column permutations.

Reduction: Again, it is found in the dissertation by Booth, but this time uses Optimal Linear Arrangement.  Recall in that problem we were given a graph and needed to find an ordering of vertices to minimize the “score” of the differences of the ordering values.

The corresponding idea here is to start with the graph and build an incidence matrix, just like in the Consecutive Ones Submatrix problem.  We notice that any permutation of the rows (in the dissertation, you permute the rows to find consecutive ones in each column) corresponds to a permutation of the vertices in the OLA problem.  If we took a permutation of the vertices, each edge (column) of the incidence matrix has exactly two 1’s in it.  There may or may not be any 0’s between those ones, but if there are, we can convert those 0’s into 1’s (this is the augmentation).

The way these problems relate is that if an edge exists between two vertices, vi and vj, located at positions i and j in the final permutation, that edge adds j-i to the cost (assuming j > i).  In our matrix, if we permute the row corresponding to vertex vi to row i, and permute the row corresponding to vertex vj to row j, then there are j-i-1 o’s between the two rows.  So we could set them all to 1, and that column would have the consecutive ones property.

Thus, a solution of cost K to the OLA problem corresponds to a solution to the Consecutive Ones Matrix Augmentation problem of cost K-|E|, since there are |E| columns in the incidence matrix, and each one differs by 1 from the OLA cost of the edge.

Difficulty: 6, mainly because of the incidence matrix/adjacency matrix difficulty I talked about last time, but also because OLA is not easy to get.  If you’ve already shown them incidence matrices (for example, because you’ve gone over the Consecutive Ones Submatrix problem), then this is probably a 4.

Consecutive Ones Submatrix

This is another problem where the reference is a PhD thesis that they had to ship to the library.  Hopefully I don’t mess the description up, because it’ll be hard to get back.

The problem: Consecutive Ones Submatrix.  This is problem SR14 in the appendix.

The description: Given an mxn matrix A, consisting of only 0’s and 1’s, and a positive integer K, can we find an mxK submatrix B of A such that we can rearrange the columns of B so that in each row, all of the 1’s occur consecutively?  (This is called the “consecutive ones property” and will come up in several problems.)

Example: Suppose this was A:

1 0 0 0
0 1 0 1
0 1 1 1
0 0 1 0

If K=3, we could among other things, choose the last 3 columns of A for B, and rearrange the second and third columns to get:

0 0 0
1 1 0
1 1 1
0 0 1

..which has the property.

The dissertation by Booth which has the reduction gives this example of a matrix that has no permutation when K=N.  (Note in the dissertation, the consecutive ones property applies to columns, not rows, so I transposed the matrix in the paper)

0 1 1 1 1
1 1 1 0 1
0 0 1 1 0
0 1 1 0 0
1 1 0 0 0

The easiest way to see why this won’t work is to realize that the fourth column needs to go last (or first) and the first column needs to go first (or last) to keep the sequences of four 1’s from being split.  So our partial solution is:

0 1 1 1 1
1 1 1 1 0
0 x x x 1
0 y y y 0
1 z z z 0

The x’s are some permutation of “010”, which means the 1 has to go last, so that puts column 3 from the original matrix there:

0 1 1 1 1
1 1 1 1 0
0 0 0 1 1
0 y y 1 0
1 z z 0 0

The y’s need to be a permutation of 01 that has a 1 in the third column, so that means that the second column needs to go into the third spot.  But the z’s need a permutation of 01 that has a 1 in the second column, so that means that the second column needs to stay in the second spot.  So no permutation works.

(While I have the dissertation, he also claims that this matrix can’t be permuted either:

0 0 1 1 0
0 0 0 1 1
0 0 0 0 1
1 1 0 0 0
1 1 1 0 0

..but it seems like this would work:

0 0 1 1 0
0 0 0 1 1
0 0 0 0 1
1 1 0 0 0
1 1 1 0 0

Maybe I’m doing the transposition wrong.  Remember, in his problem definition he permutes rows to keep consecutive ones in the columns, so he actually starts with:

1 0 0 0 1
1 1 0 0 0
1 1 0 0 0
0 0 1 1 0
0 0 0 1 1

..but I can permute rows to get:

1 1 0 0 0
1 1 0 0 0
1 0 0 0 1
0 0 0 1 1
0 0 1 1 0

I bet there is a missing 1 someplace in that matrix.)

Reduction: Is from Hamiltonian Path.  Keeping in mind that Booth’s thesis defines the problem as permuting rows to have 1’s in consecutive columns, he defines the incidence matrix of a graph G=(V,E) to be a |V|x|E| matrix M where mij is 1 if vertex i is in edge j, and 0 otherwise.  Notice that each column has exactly two 1’s in it.

(To solve G&J’s version of the problem, you’d want M=|E|x|V| and mij = 1 iff vertex j is in edge i)

Now all we have to do is take a graph, build an incidence matrix, and see if we can rearrange the rows so that the consecutive 1’s property holds for n-1 of the rows.  (The row we leave out is for the connection from the last vertex in the path back to the first vertex in the path, which we don’t have to take).

Difficulty: 5.  I’ll admit to being a bit confused by trying to use an adjacency matrix instead of an incidence matrix.  And to not really understanding the problem definition (the examples in the thesis really helped).  But really, I should have gotten this without having to make them ship the thesis all the way from California.

Sparse Matrix Compression

Here’s the problem I’ve been dreading for months.  I haven’t been able to track down the reference, and I haven’t been able to come up with a full reduction on my own.  I’ve decided that I fundamentally don’t understand the way G&J are explaining the problem and have come up with an alternate problem statement that I think meets the same goals.

The problem: Sparse Matrix Compression.  This is problem SR13 in the appendix.

G&J’s description: Given an mxn matrix A where each entry in the matrix is either 0 or 1, and an integer K ≤ m*n.  Can we find a sequence B of n+K integers, each of which is between 0 and m (inclusive), and a function s mapping the numbers from 1 to m to the numbers 1 to K, such that aij = 1 if and only if bs(i)+j-1 = i?

Example: So what does this have to do with compression, you may be asking? The idea is that the b sequence will tell us how to find runs of 1’s in various rows, and the s function tells us where each row’s sequence begins.

So, suppose this is our matrix:

1 0 1 0
0 0 1 1
0 0 0 1
1 1 0 0

The first row implies that the B sequence needs to have 1x1x in it (the x’s can be anything that’s not a 1)

The second row implies that the B sequence needs xx22 in it.

The third row implies that the B sequence needs xxx3

The fourth row implies that the B sequence needs 44xx

The shortest way I can make all of that happen is 4413122X.  In this case, s(1) = 2,  (the sequence starting at position 2: 4131 is the sequence for the first row), s(2) = 4, s(3)=1, s(4) = 1.  The X at the end can be any number, but needs to be there because the sequence needs to be of size n+K, and  the s values need to map from 1-K, and we have s(2) = 4.

(Non-)Reduction: G&J’s reference is to an “unpublished manuscript”, and they say to use graph 3- coloring, and to set K=3 (presumably related to the number of colors).  I’ve been thinking about this for months, and I think I just misunderstand the problem.   I have a real problem trying to understand the use of the K values- setting K=3 means that each row can only map to 1,2, or 3 via the s function.  And if two (or more!) rows have the same s value that means they can have no 1’s in common.  So since there are only 3 s values, if the graph is colorable, that means every row in the matrix needs to map into 3 groups of rows, where for each row they have no 1’s in common.

Which seems like the start of a good idea (the three groups are the three colors of vertices, two rows go into the same group if they can be colored the same and don’t share an edge), except that then you have to come up with a way to eventually get a number in the sequence for each 1 in each row and you need to worry about the fact that the groups that map to the other colors are starting right next to where your group starts (since every s(i) has to map to 1,2,or 3, they all have to start at one of the first 3 positions in the sequence, so for example b4 is the 4th thing in sequence 1, the 3rd thing in sequence 2, and the third thing in sequence 3.  So there needs to be at most one 1 out of all of those overlapping rows), and I couldn’t make this all work out.

Instead, I started thinking about how this problem looks an awful lot like Shortest Common Superstring– We need a common string that holds all of the strings corresponding to each row.

So, let me rewrite the problem as follows: Can we come up with a sequence B of 0’s and 1’s, and a function s such that bs(i)+j-1 = aij? Using this formulation, the best solution I could come up with for my example above was 1100011010.  s(1) = 7, s(2) = 4, s(3) = 3, s(4) = 1.  But on more sparse (or more similar) matrices, the string can get much shorter: If the matrix was 4 copies of the same row 0011, then 0011 can be the result string and s(i) = 1 for all i.  The 4×4 identity matrix can have the b sequence 0001000 and s(1) = 4, s(2) = 3, s(3)=2, s(1) = 1

This is “equivalent” to G&J’s description in the sense that it captures the same desires of the original problem- take a two-dimensional matrix that consists of 0’s and 1’s, and generate a (preferably short) one-dimensional representation with an index function that tells you where each row starts.  In this new formulation, the s function points to the start of each string, and now aij = 1 if and only if the corresponding element in the sequence is 1 (instead of i).  What isn’t so obvious to make equivalent is the K in the original problem (the number of “starting positions” you need to add to N to make the sequence) and the K here (the size of the resulting sequence).  I think it has to do with the structure of the underlying pattern of 0’s and 1’s in each row.  If there is a way to make those K’s translate in polynomial time, then this formulation is truly equivalent to G&J’s.

So now what we need is an instance of SCS where all strings are in the alphabet {0,1} (the problem is still NP-Complete in this case), make them be the rows of our matrix, and set the K’ of the Sparse Matrix Compression problem to be the K of the SCS problem, and it should work out.  I’m not 100% convinced this works, but it seems like it should.

Difficulty: 6.  Mainly because it’s so hard for me to understand the problem, but also seeing how it maps to SCS is not trivial.  But the problem as written by G&J is still unsolved.  Hopefully someone out there can explain how to solve that problem.

Hitting String

This is a good easy problem hidden in the middle of all of the hard ones.

The problem: Hitting String.  This is problem SR12 in the appendix.

The description: Given a set of strings A over the alphabet {0,1,*} all of the same length n, can we find a string x over the alphabet {0,1}, also of length n, where x agrees in at least one position with each string in A?

Example: Let A = {00000,11111,0*001, ***10, 101**, 1****}

Then x can be 10100.  It agrees with 00000 in the last position, 11111 in the first position, 0*001 in the fourth position, ***10 in the last position, and 1***** in the first position.

A pretty easy example of an instance that can’t be solved is A = {1**, 0**}, or even A = {***}

Reduction: We’ll go from 3SAT.  Each clause in the 3SAT instance will become a string in A.  Each string in A will have one position for each variable in the formula.  Each string will have a 1 in each variable’s position if that variable occurs positively in that clause, a 0 if it occurs negatively, and a * for all variables that don’t appear in the clause.  So each string will have just 3 characters that are not *.

Thus, we need to come up with a string x that has a 1 or 0 in each position (corresponding to a setting of a variable) that matches one of the three 1 or 0 characters in each string (satisfying that clause).

Difficulty: 4, maybe 3.  This is about as straightforward a “Turn SAT into a different kind of problem” reduction as you’re likely to see, but I do think crossing genre problems are harder for students than we may anticipate them to be.

Bounded Post Correspondence Problem

I’m sure most people remember the “Post Correspondence Problem” from Theory of Computation classes.  It’s mainly famous for two reasons:

  • It’s the classic example of an undecidable problem that isn’t about a language (like the Halting Problem is)
  • It leads to lots of fun (meaning “bad”) PCP jokes.  (“I need some PCP to be able to handle thinking about the PCP”)

Anyway, here is a restricted version of the problem that at least is decidable, if intractable.

The problem: Bounded Post Correspondence Problem.  This is problem SR11 in the appendix.

The description: Given two sequences of strings A =(a_1, ...a_n) and B = (b_1, ..., b_n) from a finite alphabet, and a positive integer K, can I find a sequence of integers i1..ik, where k ≤ K, and the string A’= a_{i_{1}} a_{i_{2}} ..a_{i_{k}} is identical to the string B’ =b_{i_1}b_{i_2}..b_{i_k}?

Example: The difference between this and “classic” PCP is the introduction of K, which makes the problem decidable because if nothing else you can just try all combinations of strings from A and B that are length K or less, which is finite.

So, any PCP instance that “works” is a good example to use here.  Here’s one that I got from my old Hopcroft and Ullman theory of computation book:

A =  {1, 10111,10}  B = {111,10,0}

We can construct the string 101111110.  From A, it’s built from 10111/1/1/10.  From B, it’s built from 10/111/111/0

Notice that we need to use the same elements from each string in the sequence (The second one, then the first one twice, then the third one).

Reduction: The reference in G&J is to a technical report by Constable, Hunt, and Sahni and says they use a “generic transformation”.   In the paper, that means they show it’s NP-Complete by showing it’s a “negation of freedom for a 2 variable loop free scheme”  Which is, as near as I can tell, a way of defining programs to be equivalent.

I’ll admit that I don’t really understand it.  I really want there to be a simpler reduction out there.  I found this post on stackexchange,  but it reduces from Shortest Common Supersequence, and uses a |R| of 2, and G&J say that Shortest Common Supersequence is polynomial with |R|=2, so something has to be wrong there.

I was tinkering with a reduction from Directed Hamiltonian Cycle, where the strings in A corresponded to edges, and the strings in B correspond to vertices.  So, for a graph like this:

You would get strings like this:

A B
x xa
ab bb
bc cc
ca a
cd dd

The idea is that each string in A corresponds to a directed edge, and each string in B corresponds to 2 copies of the destination of that edge.  The “x” is there to give us a starting point- we arbitrarily choose some starting vertex to be the start.  All edges that have this starting vertex as its destination only have 1 copy (instead of 2) of that vertex in B.

The problem is that a string like xabbcca  (corresponding to a non-Hamiltonian cycle) is a solution to the PCP problem, but is not a solution to the HC problem.  I don’t know how to force the string to be long enough to visit all vertices- I thought about working on a variant of the problem where you force the size of the string you create to be exactly =K instead of ≤ K.  But  since a cycle that doesn’t involve the start vertex can be “pumped” several times to make a string longer, that won’t work either.  There is probably something clever you can do with the strings to force you to visit all vertices, but I can’t see it.

Difficulty: 10, sadly.  I’m sure there is something easier out there.

Shortest Common Superstring

I’ve been doing this for long enough now that I start seeing the same paper come up multiple times in multiple situations.  When I looked up the reference for this problem, I remembered that I first got this paper as a reference for the Vertex Cover on Cubic Graphs problem, and now we’re doing the problem that they actually needed that problem for.  It was kind of cool to go into my file cabinet and say “Oh yeah, I remember this paper”

(In my VC3 post, I noted that this paper was currently my “Most Obscure Reference”.  That title has been supplanted by the Rosenthal PhD thesis I needed for the Network Survivability problem, which they sent me a hard copy of and wouldn’t let me duplicate  it or remove it from the library.   That might never be passed, though there is a problem coming up (Sparse Matrix Compression) that I still haven’t found a reference for.)

The problem: Shortest Common Substring.  This is problem SR9 in the appendix.

The description: Given a finite set R of strings from some finite alphabet, and a positive integer K, can we find a string w of length K or less such that each string in R is a substring of W?

Example: Notice how similar this is to last week’s Shortest Common Supersequence problem.  The difference is that subersequences can have gaps in the matching, but superstrings can’t.  So if we use the strings from last week: {aab, aaab, aaaba, aaaa}, our w needs to have each of these as a substring (in consecuitive positions in w).  The string “aaaaba” is the shortest I could find.  If we add “bb” to R, I think the only way to do that is to add the whole string bb either at the start or the end, since you need both bb and aaba consecutively in w.

Reduction: The paper by Maier and Storer has the reduction.  This is why they needed to prove Vertex Cover on cubic graphs was NP-Complete in the first place.  They show that this problem is true even in the case where all strings in R are the same length H, as long as H is at least 8.  For our purposes, we can get away with setting H=8, and moving on.

So, suppose we’re given a graph G=(V,E), where all vertices in G have degree 3, and an integer K.  For each vertex, we’ll label it’s 3 edges with a number between 0 and 2.  For vertex v and edge e, call this anumber g(a,e).  They don’t enforce a rule that if an edge e=(a,b) that g(a,e) = g(b,e) .

Each vertex a will generate 5 characters in the alphabet: a, a’, a0, a1, and a2. There is also a character $ not used anywhere else.  Each vertex generates 7 strings:  (Recall that H=8 for us, but I’m leaving it as H to make it consistent with the paper.

  1. $H-2aa0
  2. a0$H-2a1
  3. a1a’$H-4aa1
  4. a1$H-2a2
  5. a2a’$H-4aa2
  6. a2$H-2a0
  7. a0a’$H-2

Note that each of those strings is length H

We also define two components for each edge e=(a,b):

  • Ea = aag(a,e)a(g(a,e)+1)%3a’
  • Eb = bbg(b,e)b(g(b,e)+1)%3b’

..both of size 4.  From that we build a string of length H we add to R:

Eab=Ea$H-8Eb and

Eba = Eb$H-8Ea

..both of length H.  R has 7*|V|+2*|E| strings in it.  Set K’ = |R| + K – 11|E|.

Notice that the 7 vertex strings all overlap (the end character of one is the starting character of another).  The two strings for each edge also overlap.  So a “standard” superstring that contains all strings as a substring with these overlapping characters is of length |R| – 6|V| – 4 |E|.  Since |V| = 2/3 |E| (because G is cubic), that is length |R| – 8 |E|.

If we are given a VC of G, then suppose some vertex v is in the cover.  The edge (v,w) is connected with v.  Then either string 1,3, or 5 from the vertex list overlaps with the beginning of the string corresponding with (v,w), and either string 3,5, or 7 can overlap with the end of the edge string corresponding with (v,w).  The paper shows how making a series of these improvements, if we have a vertex cover, we can take the standard substring and bring the length down to K’.

In the other direction, if we have a minimal length superstring, we need to use it to build a cover of G.  The paper does this by showing that to be of this length, there need to be overlaps of the kind described above, and from those overlaps, we get our cover.

Difficulty: 8.  I wish there was an easy reduction to this from Shortest Common Supersequence, but I can’t find it.  I also wish that I could think of a better notation than g(a,e) because having them as subscripts and having that “mean” a character like a0-a2 is hard to follow.

Shortest Common Supersequence

Now to go back to the problems I skipped over.  This one uses a more complicated version of the “template string” used in the Longest Common Subsequence reduction.

I also think this reduction has a small mistake in the paper, though it’s possible I’m misunderstanding something.  Anyone who’s interested is encouraged to check the paper, and my work to see what’s going on.

The problem: Shortest Common Supersequence.  This is problem SR8 in the appendix.

The description: Given a finite set of strings R over some finite alphabet Σ, and a positive integer K. can I find a string w over the alphabet Σ of size K or less such that each string in R is a subsequence of w?

Example: Here’s the R from last time: {aab, aaab, aaaba, aaaa}

If we choose w to be aaaba, each string in R is a subsequence of w.  (Remember subsequence is different from substring in that subsequences are allowed to have gaps.  So for example aaaa is a subsequence of aaaba.).  If we add the string “bb” to R, then  w needs to have an extra b someplace (some candidates for w are aaabba, aaabab, baaaba, and so on).

Reduction: The same paper by Maier as last week has this reduction too (This is the “SCS” problem in the paper).  The reduction is again from VC, so we’re given a graph G=(V,E) and an integer K.   The alphabet ∑ will have one symbol for each vertex in V, one symbol for each edge in E, and a special * symbol.  Define c = max(|V|,|E|).  Also note that the vertices and edges are “ordered” somehow (it doesn’t matter the exact ordering) so that we can have an order of the vertex and edge characters in ∑.

Just as in the LCS problem, the strings in R will be based on a template string.  The template string T consists of (in order):

  • All of the alphabet symbols corresponding to the vertices in V, in order
  • A section of 4c stars
  • All of the alphabet symbols corresponding to the edges in E, in order, twice in a row (so 2 copies of the first edge symbol, then 2 copies of the second edge symbol, and so on)
  • All of the alphabet symbols corresponding to the edges in E, in order, twice in a row (again)
  • Another section of 4c stars
  • All of the alphabet symbols corresponding to the vertices in V, in order (again)

This template string is length 8c+4|E|+2|V|

T will be in R, and we will have one string in R for each edge in E.  For an edge e=(vi, vj), we create a string with (in order):

  1. Two copies of the alphabet symbol for e
  2. The alphabet symbol for vi
  3. A section of 4c stars
  4. The alphabet symbol for vj
  5. Two copies of the alphabet symbol for e (again)

Set K’ = 8c+6|E| + 2|V| + K

If G has a VC V’ of size K, then each edge has at least one vertex in V’.  Let W be the set of edges whose “first” vertex occurs in the cover (by “first”, we mean the vertex that comes alphabetically first in the ordering we used to make ∑), and let U be the set of all remaining edges (and thus for all edges in U, the “second” vertex of the edge is in V’).

Take our template string T and add:

  • The alphabet symbol for each edge in W, twice, at the beginning of T
  • The alphabet symbols for each vertex in V’ in between the two copies of the edge lists in the middle (between the third and fourth bullets in the above definition of T)
  • The alphabet symbol for each edge in U, twice, at the end of T.

This adds 2|E|+k to T’s size  (each edge is either in W or U, and will appear twice), and so this new T’ is the right size for a SCS solution.

Now we need to show that each string in R is a subsequence of this new T’.  T is obviously a subsequence of T’.  Each string in R that is based on an edge e is a subsequence by figuring whether it’s in W or U.  It may help to break down what T’ looks like:

  1. 2 copies of W
  2. 1 copy of V
  3. 4c stars
  4. 1 copy of E
  5. 1 copy of V’
  6. 1 copy of E
  7. 4c stars
  8. 1 copy of V
  9. 2 copies of U

I personally think this is a mistake in the paper.  I think U needs to go at the beginning and W needs to go at the end, see below.

If e is in W, what we do is:

  • Map section a of the string to section 4 of T’
  • Map section b of the string to section 5 of T’  (We know its in those vertices, because e is in W, so its first vertex is in the cover)
  • Map section c of the string to section 7 of T’
  • Map section d of the string to section 8 of T’
  • Map section e of the string to section 9 of T’

That last part is why I think there is a mistake either in the paper, or in the way I understand it.  You can only map the edge to the last part of T’ if the edge is in that part.  So that part needs to be W, not U.  I can’t find a way to make a mapping if W is first.

If e is in U, what we do is:

  • Map section a of the string to section 1 of T’ (again, assuming the paper has the U’s and W’s backwards)
  • Map section b of the string to section 2 of T’
  • Map section c of the string to section 3 of T’
  • Map section d of the string to section 5 of T’ (the edge being in U means its second vertex is in the cover)
  • Map section e of the string to section 6 of T’

So, T’ is a supersequence of all strings in R.

In the other direction, suppose the strings in R have a supersequence of length K’.  Call that supersequence T” (since it’s not necessarily T’).  To be a supersequence of T,  T” needs to contain two sets of 4c stars someplace, and they will be consecutive (otherwise T” is too long).   Each other string in R will have it’s stars map to either the left set of 4c stars in T or the right set (they won’t split), otherwise T” is too long.

Suppose for some string in R, the stars go to the part of T” that maps the left set of stars in T.  Since there are no edge characters in T before the left set of stars, there needs to be characters added to T” to handle part a of the edge string.  (If the edge string’s stars map to the right end, we have a similar problem at the end of the string). We will also need a place to put the second vertex of the edge (part d of the string) before the copy of the edges in E.

The only way to add these extra symbols to T (forming T”) and not exceed K’ characters in length is to build something like T’ which starts with T, adds a vertex cover in the middle, and adds edges whose first vertex is not in the cover on one side of T, and edges whose second vertex is not in the cover on the other side.

Difficulty: 7.  It’s very hard to see why you need 2 copies of each edge, or why 4c is the number of stars you need.  But the idea of a construction where vertices are on the outside and edges are on the inside (in T) that needs to match up with strings with edges on the outside and vertices on the inside (in the rest of R) is something that I think students can follow, even if I doubt they could come up with it themselves.

Longest Common Subsequence

I’m going out of order here, since the next few problems are all related, and this is the easiest one, and is a good way to get thinking about this kind of problem.

The problem: Longest Common Subsequence.  This is problem SR10 in the appendix.

The description: Given a finite set of strings R over some finite alphabet Σ, and a positive integer K, can I find a string w over the alphabet Σ such that w has length K or more, and w is a subsequence of each string in R?

Example: Suppose R was the strings {aab, aaab, aaaba, aaaa}

Then the longest w we could choose that is a subsequence of everything in R wold be “aa”.  Notice that if we add “ba” to R, then the longest w we can make is length 1 (the string “a” or the string “b”)

Reduction: The paper by Maier uses VC.  (This is the “LCS” problem in the paper).  So we start with a graph G = (V,E)  and a K.  Our alphabet Σ has one character for each vertex in V.  R will consist of |E|+1 strings, based on a “template string” T consisting of the characters for all of the vertices in V, in some order. The first string in R is T itself.  Then, each edge (u,v) gets a string: Two copies of T concatenated.  The first copy has u removed, the second one has v removed.  Set K’ to |V|-K.

Suppose G has a VC, call it V’.  Then take T and remove all of the vertices in V’ from it.  This remaining string will be called T’ and is of size |V|-K.  We will show that T’ is a subsequence of each string in R.  It’s pretty clearly a substring of T.

Each other string in R is built from some edge (u,v) in E.  So either u or v is in V’.  So u will be missing from T’, and u will be missing from the first copy of T in the string in R, so T’ is a substring of that first copy.  Similarly, if v is in V’, then T’ is a substring of the second copy of T.

If we find a string T’ of size K’, then first notice that for each string in R made from an edge (u,v), T’ can’t have both u and v.  The reason is that if (say) u comes before v in T, then in the string in R, u will be missing in the first half of that string, and v will be missing in the second half.  So the occurrence of v will come first in the string and the occurrence of u will come second, so they will be “swapped”.

It’s much easier to see with an example.  Suppose we have 5 vertices, and T = 12345.  If we have the edge (2,4), then the string in R is 13451235.  T’ can’t contain 24 in that order, because the 4 comes before the 2 in the string.

So this means that if we let V’ be the vertices in V that are not in T’, we have K vertices and at least one of the characters corresponding to that vertex is in each string in R, and so at least one vertex from V’ is in each edge in E.  This forms a vertex cover of V.

Difficulty: 5.  I may have downgraded this difficulty after seeing the much harder SR8 and SR9, but I do think that it is pretty straightforward.

Capacity Assignment

This problem is from the same “unpublished manuscript” as last week’s.

The problem: Capacity Assignment.  This is problem SR7 in the appendix.

The description: Given a set C of “communication links”, and set M of positive capacities.  Each pair of a link c and a capacity m also has a cost function g(c,m) and delay penalty d(c,m) that has the following properties:

  • If i < j ∈ M, then g(c,i) ≤ g(c,j)
  • If i < j ∈ M, then d(c,i) ≥ d(c,j)

We’re also given positive integers K and J.  The problem is: Can we assign a capacity to each link such that the total g cost of all of our assignments is ≤ K and the total d cost of all of our assignments is ≤ J?

Example: There’s a lot to parse in that problem description.  The first thing to notice is that the set of links C doesn’t necessarily have to link anything together (it’s not like it has to apply to an underlying graph).  So we can just give them names:

C={a,b,c,d,e}

Next, there is no reason why the set of capacities has to be assigned as a bijection to C- the set M could be a different size entirely than the size of C:

M={1,2}

The cost function has to have the property that if we assign a 2 to a link, it has to cost as least as much as assigning 1 to the link:

g(c,1) = 3 for all c

g(c,2) = 4 for all c

The delay function has to have the property that if we assign a higher capacity to a link, the delay can’t be larger than assigning a lower capacity:

d(c,1) = 6 for all c

d(c,2) = 5 for all c

In this case, if we assign the capacity of 1 to all links, we get a total cost of 15 and a total delay of 30.  If we assign the capacity of 2 to all links, we get a total cost of 20 and a total delay of 25.     If we have K = 18, and J = 27, we can achieve that by setting 2 links to have capacity 1 and 3 links to have capacity 2.

The reduction: The example above is pretty close to how the reduction will work.  We will reduce from Sum of Subsets, so we start with a set S of integers and a target B.   Our set C will have one element for each element in S.  Our set M will be {1,2}.  Assigning a capacity of 1 will imply we don’t want to take this element in S’, and assigning a capacity of 2 will imply that we do.  (This makes more sense if I can use the set {0,1} for M, but the problem description says the elements of M have to be positive)

We will define our g function so that g(c,1) = 1 for all c, and g(c,2) will be s(c)+1 (where s(c) is the size of the element in S that corresponds to c).

Our d function will work similarly:  d(c,1) = s(c)+1 for all c, and d(c,2) = 1 for all c.  These functions both follow the restrictions for how g and d work.

Set K = |S| + B.  Since each cost is either s(c)+1 or 1, this is saying that there needs to be enough elements assigned a 1 (such that its cost is 1, instead of s(c)+1) to that the sizes of those elements does not exceed K.

Let T = The sum of all of the sizes of all of the elements in S.  Then let J = |S| + T – B.  Again, each d value always includes 1, and may include s(c) as well.  So this is saying that there needs to be enough values assigned a 2 (so that its delay is 1) so that the sizes of those elements does not exceed J.

If S has a SOS solution S’, then assigning a capacity of 2 to all elements in S’ and a 1 to all elements in S’ gives us a cost value of exactly K, and a delay value of exactly J.

If we have a Capacity Assignment solution, then notice that K+J = 2|S|  + T, and so is the sum of all delays and capacities no matter what assignment is chosen.  (g(c,m) + d(c,m) = s(c)+2, for all c, no matter what m we use).  So if the sum of the delays (or costs) were strictly less than K, the sum of the costs (or delays) would have to be strictly more than J.  The only way to satisfy both the K and J constraints is to make the sums exactly equal, which gives us a SOS solution.

Difficulty: 4.  I think the algebra for this problem is a little easier than last week’s, but it does take some work to understand what the problem is asking.  Changing the problem slightly to allow assignments and costs and delays to be 0 instead of making them all be positive integers makes the reduction easier too.

Multiple Copy File Allocation

These next two problems are from an “unpublished manuscript” that I can’t find.  Assuming I did them correctly, I think they are also good problems for students to work on.

The problem: Multiple Copy File Allocation.  This is problem SR6 in the appendix.

The description: Given a graph G=(V,E), where each vertex v ∈ V has a “usage” u(v), and a “storage cost” s(v) (both positive integers), and a positive integer K.  Can we find a subset V’ of V where for each v ∈ V we define d(v) to be the length of the shortest path from v to some member of V’, that the sum of:

  • The s(v) values for all vertices in V’ plus
  • The d(v)*u(v) values for all vertices not in V’ is ≤ K?

Example: The idea is that each vertex in V’ pays its s(v) cost, and each vertex not in V’ pays its u(v) cost for each edge is has to go through to get to something in V’.  Here’s a simple example:

sr6

Each node lists its s value first and u value second.  If we take just the “1,10” node into V’, our total cost is:

  • 1 from the “1,10” node (its s value)
  • 2 + 3 + 4 from each of its neighbors (their u values * a distance of 1)
  • 2 from the “90,1” vertex.  (u cost of 1, times a distance of 2 edges from the vertex in V’)

..for a total of 12.  I think that’s minimal.

Reduction: G&J say to use  Vertex Cover, and mention that we can make all s values equal to each other and all u values equal to each other.  The phrasing implies that we shouldn’t make the s value and u values the same altogether, which is a good hint.

So we start with out VC instance: a graph G=(V,E) and an integer K.  We will use the same graph.  Let s = the s(v) for each vertex = |V|.  Let u = the u(v) for each vertex =  \lceil|V|/2 \rceil + 1 (The smallest integer larger than |V|/2).

Set K’ = K*s + (|V|-K)* u.  Hopefully it’s clear how this number will allow a solution to the VC problem take that cover and create a solution to the MCFA problem.

If we have a MFCA solution, first notice that any vertices that are 2 (or more) edges away from something in V’ are contributing 2*u to the total cost.  Since 2*u > s, we can add a vertex to V’ at a distance 1 away from these distant vertices and keep our total cost ≤ K’.  This means we can transform any MCFA solution into one that is also a solution but where V’ is a cover of V.

Also notice that even after making these changes that if we were to have K+1 (or more) vertices in V’, our total cost would be (K+1)*s + (|V|-K-1)* u.  This is more than K’ (it adds an extra s of cost |V| and removes a u of cost |V|/2+1).

So, we can construct a MCFA solution where there are K or less vertices in V’ that form a cover.  This gives us a solution to the VC instance.

Difficulty: 5.  It takes a little playing around with the algebra to get the correct values, but the idea that we can make the s and u values the same for all vertices is helpful, and gives a clear path to how you’d reduce this problem from VC.