Tag Archives: SR22

External Macro Data Compression

These next problems are related “macro compression” problems.

The problem: External Macro Data Compression.  This is problem SR22 in the appendix.

The description: Given a string s over some alphabet Σ, a “pointer cost” h and a bound B, can we find two strings D and C over the alphabet of Σ augmented with some number (< |s|) of “pointer characters” pi such that:

  • |D| + |C| + (h-1)* (# of p characters in D and C) ≤ B
  • We can generate s by replacing pointer characters in C with their “meaning” in D.

Example: The concept they’re trying to get at isn’t that hard, it’s just hard to explain using mathematical language.  Suppose s = “ABCDEFABCGABCDEF”

Then we can define D to be “ABCDEF” and C to be “pqpGpq”.  The total size of this is 6 for D, plus 6 for C.  There are 5 pointers, so our total cost is 6+6+5h.

The idea is that the characters p and q are “pointers” that refer to substrings in D (p refers to “ABC” and q refers to “DEF”).  By replacing those pointers with what they “mean” in C, we can get back s.

A tricky part of this is that you are allowed to have substrings overlap.  So if s was “ABCDBCD” we could define D to be “ABCD” and C to be “pq” with p meaning “ABCD” and q meaning “BCD”.  Now our total cost is 4 for D, 2 for C, and 2 pointers, so 4+2+h.

Reduction:  The reduction (and the one for SR23) comes from a technical report by Storer, which is pretty dense.  I think we’re looking at “Theorem 2” in the paper, which is from VC<=3.

The alphabet that will be built from the VC instance has a lot of parts:

  • A special symbol $
  • 3 symbols vi, ai, and bi for each vertex vi in the graph
  • 4 more symbols fi,1,1 through fi,2,2 for each vertex vi in the graph
  • one symbol di for each edge ej in the graph.
  • 2 symbols c1 and c2 (there is actually one c symbol for each value of h.  We’ll assume h=2 here)
  • 3 symbols g1,1 g1,2 and g2,1 (This is also based on h.  So really you’d go from g1,1 through gh-1,2 and add gh,1)

The string s will also be built from parts (a lot of these are based on h has well, again, I’m fixing h=2 to keep it simpler)

  • Vi,l = ai$vi
  • Vi,2 = vi$bi
  • For each edge ei= (vj, vk), Ei = $vj$vj$
  • We also have Z1 = (c1)3 (so 3 copies of c1) and Z2 = (c2)3

s is:

  • Eidi  concatenated for each edge, followed by
  • Vi,jfi,j,k concatenated over each vertex and each possible f symbol, followed by
  • Z1g1,1Z1g1,2, followed by
  • Z2g2,1Z2

K’ = |s| + K – (7/2)|V|.

The basic idea from here is that if G has a VC, we can compress s by making pointers for (among other things) the combination of Vi,1fi,1,2Vi,2 and the combination of Vi,2fi,2,2Vi+1,1  where vertex vi is in the VC.  This lets us use the pointers both for the part of the string with the V’s and f’s, but also for the $vi$ strings in the E components, saving space.

In the other direction, if we have a compressed string of size K’, he shows that means that you have compress a string like the above and the overlaps show you the vertices in the cover.

Difficulty: 8.  I think to really get what’s happening, you need to actually build an example on a graph.  But I think the idea of building the sets so that that is the overlap you’re looking for is something that can be gotten eventually.