These next problems are related “macro compression” problems.
The problem: External Macro Data Compression. This is problem SR22 in the appendix.
The description: Given a string s over some alphabet Σ, a “pointer cost” h and a bound B, can we find two strings D and C over the alphabet of Σ augmented with some number (< |s|) of “pointer characters” pi such that:
- |D| + |C| + (h-1)* (# of p characters in D and C) ≤ B
- We can generate s by replacing pointer characters in C with their “meaning” in D.
Example: The concept they’re trying to get at isn’t that hard, it’s just hard to explain using mathematical language. Suppose s = “ABCDEFABCGABCDEF”
Then we can define D to be “ABCDEF” and C to be “pqpGpq”. The total size of this is 6 for D, plus 6 for C. There are 5 pointers, so our total cost is 6+6+5h.
The idea is that the characters p and q are “pointers” that refer to substrings in D (p refers to “ABC” and q refers to “DEF”). By replacing those pointers with what they “mean” in C, we can get back s.
A tricky part of this is that you are allowed to have substrings overlap. So if s was “ABCDBCD” we could define D to be “ABCD” and C to be “pq” with p meaning “ABCD” and q meaning “BCD”. Now our total cost is 4 for D, 2 for C, and 2 pointers, so 4+2+h.
The alphabet that will be built from the VC instance has a lot of parts:
- A special symbol $
- 3 symbols vi, ai, and bi for each vertex vi in the graph
- 4 more symbols fi,1,1 through fi,2,2 for each vertex vi in the graph
- one symbol di for each edge ej in the graph.
- 2 symbols c1 and c2 (there is actually one c symbol for each value of h. We’ll assume h=2 here)
- 3 symbols g1,1 g1,2 and g2,1 (This is also based on h. So really you’d go from g1,1 through gh-1,2 and add gh,1)
The string s will also be built from parts (a lot of these are based on h has well, again, I’m fixing h=2 to keep it simpler)
- Vi,l = ai$vi
- Vi,2 = vi$bi
- For each edge ei= (vj, vk), Ei = $vj$vj$
- We also have Z1 = (c1)3 (so 3 copies of c1) and Z2 = (c2)3
- Eidi concatenated for each edge, followed by
- Vi,jfi,j,k concatenated over each vertex and each possible f symbol, followed by
- Z1g1,1Z1g1,2, followed by
K’ = |s| + K – (7/2)|V|.
The basic idea from here is that if G has a VC, we can compress s by making pointers for (among other things) the combination of Vi,1fi,1,2Vi,2 and the combination of Vi,2fi,2,2Vi+1,1 where vertex vi is in the VC. This lets us use the pointers both for the part of the string with the V’s and f’s, but also for the $vi$ strings in the E components, saving space.
In the other direction, if we have a compressed string of size K’, he shows that means that you have compress a string like the above and the overlaps show you the vertices in the cover.
Difficulty: 8. I think to really get what’s happening, you need to actually build an example on a graph. But I think the idea of building the sets so that that is the overlap you’re looking for is something that can be gotten eventually.