Shortest Common Superstring

I’ve been doing this for long enough now that I start seeing the same paper come up multiple times in multiple situations.  When I looked up the reference for this problem, I remembered that I first got this paper as a reference for the Vertex Cover on Cubic Graphs problem, and now we’re doing the problem that they actually needed that problem for.  It was kind of cool to go into my file cabinet and say “Oh yeah, I remember this paper”

(In my VC3 post, I noted that this paper was currently my “Most Obscure Reference”.  That title has been supplanted by the Rosenthal PhD thesis I needed for the Network Survivability problem, which they sent me a hard copy of and wouldn’t let me duplicate  it or remove it from the library.   That might never be passed, though there is a problem coming up (Sparse Matrix Compression) that I still haven’t found a reference for.)

The problem: Shortest Common Substring.  This is problem SR9 in the appendix.

The description: Given a finite set R of strings from some finite alphabet, and a positive integer K, can we find a string w of length K or less such that each string in R is a substring of W?

Example: Notice how similar this is to last week’s Shortest Common Supersequence problem.  The difference is that subersequences can have gaps in the matching, but superstrings can’t.  So if we use the strings from last week: {aab, aaab, aaaba, aaaa}, our w needs to have each of these as a substring (in consecuitive positions in w).  The string “aaaaba” is the shortest I could find.  If we add “bb” to R, I think the only way to do that is to add the whole string bb either at the start or the end, since you need both bb and aaba consecutively in w.

Reduction: The paper by Maier and Storer has the reduction.  This is why they needed to prove Vertex Cover on cubic graphs was NP-Complete in the first place.  They show that this problem is true even in the case where all strings in R are the same length H, as long as H is at least 8.  For our purposes, we can get away with setting H=8, and moving on.

So, suppose we’re given a graph G=(V,E), where all vertices in G have degree 3, and an integer K.  For each vertex, we’ll label it’s 3 edges with a number between 0 and 2.  For vertex v and edge e, call this anumber g(a,e).  They don’t enforce a rule that if an edge e=(a,b) that g(a,e) = g(b,e) .

Each vertex a will generate 5 characters in the alphabet: a, a’, a0, a1, and a2. There is also a character $ not used anywhere else.  Each vertex generates 7 strings:  (Recall that H=8 for us, but I’m leaving it as H to make it consistent with the paper.

  1. $H-2aa0
  2. a0$H-2a1
  3. a1a’$H-4aa1
  4. a1$H-2a2
  5. a2a’$H-4aa2
  6. a2$H-2a0
  7. a0a’$H-2

Note that each of those strings is length H

We also define two components for each edge e=(a,b):

  • Ea = aag(a,e)a(g(a,e)+1)%3a’
  • Eb = bbg(b,e)b(g(b,e)+1)%3b’

..both of size 4.  From that we build a string of length H we add to R:

Eab=Ea$H-8Eb and

Eba = Eb$H-8Ea

..both of length H.  R has 7*|V|+2*|E| strings in it.  Set K’ = |R| + K – 11|E|.

Notice that the 7 vertex strings all overlap (the end character of one is the starting character of another).  The two strings for each edge also overlap.  So a “standard” superstring that contains all strings as a substring with these overlapping characters is of length |R| – 6|V| – 4 |E|.  Since |V| = 2/3 |E| (because G is cubic), that is length |R| – 8 |E|.

If we are given a VC of G, then suppose some vertex v is in the cover.  The edge (v,w) is connected with v.  Then either string 1,3, or 5 from the vertex list overlaps with the beginning of the string corresponding with (v,w), and either string 3,5, or 7 can overlap with the end of the edge string corresponding with (v,w).  The paper shows how making a series of these improvements, if we have a vertex cover, we can take the standard substring and bring the length down to K’.

In the other direction, if we have a minimal length superstring, we need to use it to build a cover of G.  The paper does this by showing that to be of this length, there need to be overlaps of the kind described above, and from those overlaps, we get our cover.

Difficulty: 8.  I wish there was an easy reduction to this from Shortest Common Supersequence, but I can’t find it.  I also wish that I could think of a better notation than g(a,e) because having them as subscripts and having that “mean” a character like a0-a2 is hard to follow.

Leave a Reply

Your email address will not be published. Required fields are marked *