Tag Archives: Shortest Common Supersequence

Bounded Post Correspondence Problem

I’m sure most people remember the “Post Correspondence Problem” from Theory of Computation classes.  It’s mainly famous for two reasons:

  • It’s the classic example of an undecidable problem that isn’t about a language (like the Halting Problem is)
  • It leads to lots of fun (meaning “bad”) PCP jokes.  (“I need some PCP to be able to handle thinking about the PCP”)

Anyway, here is a restricted version of the problem that at least is decidable, if intractable.

The problem: Bounded Post Correspondence Problem.  This is problem SR11 in the appendix.

The description: Given two sequences of strings A =(a_1, ...a_n) and B = (b_1, ..., b_n) from a finite alphabet, and a positive integer K, can I find a sequence of integers i1..ik, where k ≤ K, and the string A’= a_{i_{1}} a_{i_{2}} ..a_{i_{k}} is identical to the string B’ =b_{i_1}b_{i_2}..b_{i_k}?

Example: The difference between this and “classic” PCP is the introduction of K, which makes the problem decidable because if nothing else you can just try all combinations of strings from A and B that are length K or less, which is finite.

So, any PCP instance that “works” is a good example to use here.  Here’s one that I got from my old Hopcroft and Ullman theory of computation book:

A =  {1, 10111,10}  B = {111,10,0}

We can construct the string 101111110.  From A, it’s built from 10111/1/1/10.  From B, it’s built from 10/111/111/0

Notice that we need to use the same elements from each string in the sequence (The second one, then the first one twice, then the third one).

Reduction: The reference in G&J is to a technical report by Constable, Hunt, and Sahni and says they use a “generic transformation”.   In the paper, that means they show it’s NP-Complete by showing it’s a “negation of freedom for a 2 variable loop free scheme”  Which is, as near as I can tell, a way of defining programs to be equivalent.

I’ll admit that I don’t really understand it.  I really want there to be a simpler reduction out there.  I found this post on stackexchange,  but it reduces from Shortest Common Supersequence, and uses a |R| of 2, and G&J say that Shortest Common Supersequence is polynomial with |R|=2, so something has to be wrong there.

I was tinkering with a reduction from Directed Hamiltonian Cycle, where the strings in A corresponded to edges, and the strings in B correspond to vertices.  So, for a graph like this:

You would get strings like this:

A B
x xa
ab bb
bc cc
ca a
cd dd

The idea is that each string in A corresponds to a directed edge, and each string in B corresponds to 2 copies of the destination of that edge.  The “x” is there to give us a starting point- we arbitrarily choose some starting vertex to be the start.  All edges that have this starting vertex as its destination only have 1 copy (instead of 2) of that vertex in B.

The problem is that a string like xabbcca  (corresponding to a non-Hamiltonian cycle) is a solution to the PCP problem, but is not a solution to the HC problem.  I don’t know how to force the string to be long enough to visit all vertices- I thought about working on a variant of the problem where you force the size of the string you create to be exactly =K instead of ≤ K.  But  since a cycle that doesn’t involve the start vertex can be “pumped” several times to make a string longer, that won’t work either.  There is probably something clever you can do with the strings to force you to visit all vertices, but I can’t see it.

Difficulty: 10, sadly.  I’m sure there is something easier out there.

Shortest Common Superstring

I’ve been doing this for long enough now that I start seeing the same paper come up multiple times in multiple situations.  When I looked up the reference for this problem, I remembered that I first got this paper as a reference for the Vertex Cover on Cubic Graphs problem, and now we’re doing the problem that they actually needed that problem for.  It was kind of cool to go into my file cabinet and say “Oh yeah, I remember this paper”

(In my VC3 post, I noted that this paper was currently my “Most Obscure Reference”.  That title has been supplanted by the Rosenthal PhD thesis I needed for the Network Survivability problem, which they sent me a hard copy of and wouldn’t let me duplicate  it or remove it from the library.   That might never be passed, though there is a problem coming up (Sparse Matrix Compression) that I still haven’t found a reference for.)

The problem: Shortest Common Substring.  This is problem SR9 in the appendix.

The description: Given a finite set R of strings from some finite alphabet, and a positive integer K, can we find a string w of length K or less such that each string in R is a substring of W?

Example: Notice how similar this is to last week’s Shortest Common Supersequence problem.  The difference is that subersequences can have gaps in the matching, but superstrings can’t.  So if we use the strings from last week: {aab, aaab, aaaba, aaaa}, our w needs to have each of these as a substring (in consecuitive positions in w).  The string “aaaaba” is the shortest I could find.  If we add “bb” to R, I think the only way to do that is to add the whole string bb either at the start or the end, since you need both bb and aaba consecutively in w.

Reduction: The paper by Maier and Storer has the reduction.  This is why they needed to prove Vertex Cover on cubic graphs was NP-Complete in the first place.  They show that this problem is true even in the case where all strings in R are the same length H, as long as H is at least 8.  For our purposes, we can get away with setting H=8, and moving on.

So, suppose we’re given a graph G=(V,E), where all vertices in G have degree 3, and an integer K.  For each vertex, we’ll label it’s 3 edges with a number between 0 and 2.  For vertex v and edge e, call this anumber g(a,e).  They don’t enforce a rule that if an edge e=(a,b) that g(a,e) = g(b,e) .

Each vertex a will generate 5 characters in the alphabet: a, a’, a0, a1, and a2. There is also a character $ not used anywhere else.  Each vertex generates 7 strings:  (Recall that H=8 for us, but I’m leaving it as H to make it consistent with the paper.

  1. $H-2aa0
  2. a0$H-2a1
  3. a1a’$H-4aa1
  4. a1$H-2a2
  5. a2a’$H-4aa2
  6. a2$H-2a0
  7. a0a’$H-2

Note that each of those strings is length H

We also define two components for each edge e=(a,b):

  • Ea = aag(a,e)a(g(a,e)+1)%3a’
  • Eb = bbg(b,e)b(g(b,e)+1)%3b’

..both of size 4.  From that we build a string of length H we add to R:

Eab=Ea$H-8Eb and

Eba = Eb$H-8Ea

..both of length H.  R has 7*|V|+2*|E| strings in it.  Set K’ = |R| + K – 11|E|.

Notice that the 7 vertex strings all overlap (the end character of one is the starting character of another).  The two strings for each edge also overlap.  So a “standard” superstring that contains all strings as a substring with these overlapping characters is of length |R| – 6|V| – 4 |E|.  Since |V| = 2/3 |E| (because G is cubic), that is length |R| – 8 |E|.

If we are given a VC of G, then suppose some vertex v is in the cover.  The edge (v,w) is connected with v.  Then either string 1,3, or 5 from the vertex list overlaps with the beginning of the string corresponding with (v,w), and either string 3,5, or 7 can overlap with the end of the edge string corresponding with (v,w).  The paper shows how making a series of these improvements, if we have a vertex cover, we can take the standard substring and bring the length down to K’.

In the other direction, if we have a minimal length superstring, we need to use it to build a cover of G.  The paper does this by showing that to be of this length, there need to be overlaps of the kind described above, and from those overlaps, we get our cover.

Difficulty: 8.  I wish there was an easy reduction to this from Shortest Common Supersequence, but I can’t find it.  I also wish that I could think of a better notation than g(a,e) because having them as subscripts and having that “mean” a character like a0-a2 is hard to follow.

Shortest Common Supersequence

Now to go back to the problems I skipped over.  This one uses a more complicated version of the “template string” used in the Longest Common Subsequence reduction.

I also think this reduction has a small mistake in the paper, though it’s possible I’m misunderstanding something.  Anyone who’s interested is encouraged to check the paper, and my work to see what’s going on.

The problem: Shortest Common Supersequence.  This is problem SR8 in the appendix.

The description: Given a finite set of strings R over some finite alphabet Σ, and a positive integer K. can I find a string w over the alphabet Σ of size K or less such that each string in R is a subsequence of w?

Example: Here’s the R from last time: {aab, aaab, aaaba, aaaa}

If we choose w to be aaaba, each string in R is a subsequence of w.  (Remember subsequence is different from substring in that subsequences are allowed to have gaps.  So for example aaaa is a subsequence of aaaba.).  If we add the string “bb” to R, then  w needs to have an extra b someplace (some candidates for w are aaabba, aaabab, baaaba, and so on).

Reduction: The same paper by Maier as last week has this reduction too (This is the “SCS” problem in the paper).  The reduction is again from VC, so we’re given a graph G=(V,E) and an integer K.   The alphabet ∑ will have one symbol for each vertex in V, one symbol for each edge in E, and a special * symbol.  Define c = max(|V|,|E|).  Also note that the vertices and edges are “ordered” somehow (it doesn’t matter the exact ordering) so that we can have an order of the vertex and edge characters in ∑.

Just as in the LCS problem, the strings in R will be based on a template string.  The template string T consists of (in order):

  • All of the alphabet symbols corresponding to the vertices in V, in order
  • A section of 4c stars
  • All of the alphabet symbols corresponding to the edges in E, in order, twice in a row (so 2 copies of the first edge symbol, then 2 copies of the second edge symbol, and so on)
  • All of the alphabet symbols corresponding to the edges in E, in order, twice in a row (again)
  • Another section of 4c stars
  • All of the alphabet symbols corresponding to the vertices in V, in order (again)

This template string is length 8c+4|E|+2|V|

T will be in R, and we will have one string in R for each edge in E.  For an edge e=(vi, vj), we create a string with (in order):

  1. Two copies of the alphabet symbol for e
  2. The alphabet symbol for vi
  3. A section of 4c stars
  4. The alphabet symbol for vj
  5. Two copies of the alphabet symbol for e (again)

Set K’ = 8c+6|E| + 2|V| + K

If G has a VC V’ of size K, then each edge has at least one vertex in V’.  Let W be the set of edges whose “first” vertex occurs in the cover (by “first”, we mean the vertex that comes alphabetically first in the ordering we used to make ∑), and let U be the set of all remaining edges (and thus for all edges in U, the “second” vertex of the edge is in V’).

Take our template string T and add:

  • The alphabet symbol for each edge in W, twice, at the beginning of T
  • The alphabet symbols for each vertex in V’ in between the two copies of the edge lists in the middle (between the third and fourth bullets in the above definition of T)
  • The alphabet symbol for each edge in U, twice, at the end of T.

This adds 2|E|+k to T’s size  (each edge is either in W or U, and will appear twice), and so this new T’ is the right size for a SCS solution.

Now we need to show that each string in R is a subsequence of this new T’.  T is obviously a subsequence of T’.  Each string in R that is based on an edge e is a subsequence by figuring whether it’s in W or U.  It may help to break down what T’ looks like:

  1. 2 copies of W
  2. 1 copy of V
  3. 4c stars
  4. 1 copy of E
  5. 1 copy of V’
  6. 1 copy of E
  7. 4c stars
  8. 1 copy of V
  9. 2 copies of U

I personally think this is a mistake in the paper.  I think U needs to go at the beginning and W needs to go at the end, see below.

If e is in W, what we do is:

  • Map section a of the string to section 4 of T’
  • Map section b of the string to section 5 of T’  (We know its in those vertices, because e is in W, so its first vertex is in the cover)
  • Map section c of the string to section 7 of T’
  • Map section d of the string to section 8 of T’
  • Map section e of the string to section 9 of T’

That last part is why I think there is a mistake either in the paper, or in the way I understand it.  You can only map the edge to the last part of T’ if the edge is in that part.  So that part needs to be W, not U.  I can’t find a way to make a mapping if W is first.

If e is in U, what we do is:

  • Map section a of the string to section 1 of T’ (again, assuming the paper has the U’s and W’s backwards)
  • Map section b of the string to section 2 of T’
  • Map section c of the string to section 3 of T’
  • Map section d of the string to section 5 of T’ (the edge being in U means its second vertex is in the cover)
  • Map section e of the string to section 6 of T’

So, T’ is a supersequence of all strings in R.

In the other direction, suppose the strings in R have a supersequence of length K’.  Call that supersequence T” (since it’s not necessarily T’).  To be a supersequence of T,  T” needs to contain two sets of 4c stars someplace, and they will be consecutive (otherwise T” is too long).   Each other string in R will have it’s stars map to either the left set of 4c stars in T or the right set (they won’t split), otherwise T” is too long.

Suppose for some string in R, the stars go to the part of T” that maps the left set of stars in T.  Since there are no edge characters in T before the left set of stars, there needs to be characters added to T” to handle part a of the edge string.  (If the edge string’s stars map to the right end, we have a similar problem at the end of the string). We will also need a place to put the second vertex of the edge (part d of the string) before the copy of the edges in E.

The only way to add these extra symbols to T (forming T”) and not exceed K’ characters in length is to build something like T’ which starts with T, adds a vertex cover in the middle, and adds edges whose first vertex is not in the cover on one side of T, and edges whose second vertex is not in the cover on the other side.

Difficulty: 7.  It’s very hard to see why you need 2 copies of each edge, or why 4c is the number of stars you need.  But the idea of a construction where vertices are on the outside and edges are on the inside (in T) that needs to match up with strings with edges on the outside and vertices on the inside (in the rest of R) is something that I think students can follow, even if I doubt they could come up with it themselves.