These next problems are related “macro compression” problems.

**The problem: **External Macro Data Compression. This is problem SR22 in the appendix.

**The description: **Given a string s over some alphabet Σ, a “pointer cost” h and a bound B, can we find two strings D and C over the alphabet of Σ augmented with some number (< |s|) of “pointer characters” p_{i} such that:

- |D| + |C| + (h-1)* (# of p characters in D and C) ≤ B
- We can generate s by replacing pointer characters in C with their “meaning” in D.

**Example: ** The concept they’re trying to get at isn’t that hard, it’s just hard to explain using mathematical language. Suppose s = “ABCDEFABCGABCDEF”

Then we can define D to be “ABCDEF” and C to be “pqpGpq”. The total size of this is 6 for D, plus 6 for C. There are 5 pointers, so our total cost is 6+6+5h.

The idea is that the characters p and q are “pointers” that refer to substrings in D (p refers to “ABC” and q refers to “DEF”). By replacing those pointers with what they “mean” in C, we can get back s.

A tricky part of this is that you are allowed to have substrings overlap. So if s was “ABCDBCD” we could define D to be “ABCD” and C to be “pq” with p meaning “ABCD” and q meaning “BCD”. Now our total cost is 4 for D, 2 for C, and 2 pointers, so 4+2+h.

**Reduction**: The reduction (and the one for SR23) comes from a technical report by Storer, which is pretty dense. I think we’re looking at “Theorem 2” in the paper, which is from VC<=3.

The alphabet that will be built from the VC instance has a lot of parts:

- A special symbol $
- 3 symbols v
_{i}, a_{i}, and b_{i}for each vertex v_{i}in the graph - 4 more symbols f
_{i,1,1}through f_{i,2,2}for each vertex v_{i}in the graph - one symbol d
_{i}for each edge e_{j}in the graph. - 2 symbols c
_{1}and c_{2}(there is actually one c symbol for each value of h. We’ll assume h=2 here) - 3 symbols g
_{1,1}g_{1,2}and g_{2,1}(This is also based on h. So really you’d go from g_{1,1}through g_{h-1,2}and add g_{h,1})

The string s will also be built from parts (a lot of these are based on h has well, again, I’m fixing h=2 to keep it simpler)

- V
_{i,l}= a_{i}$v_{i} - V
_{i,2}= v_{i}$b_{i} - For each edge e
_{i}= (v_{j}, v_{k}), E_{i}= $v_{j}$v_{j}$ - We also have Z
_{1}= (c_{1})^{3}(so 3 copies of c_{1}) and Z_{2}= (c_{2})^{3}

s is:

- E
_{i}d_{i}concatenated for each edge, followed by - V
_{i,j}f_{i,j,k}concatenated over each vertex and each possible f symbol, followed by - Z
_{1}g_{1,1}Z_{1}g_{1,2}, followed by - Z
_{2}g_{2,1}Z_{2}

K’ = |s| + K – (7/2)|V|.

The basic idea from here is that if G has a VC, we can compress s by making pointers for (among other things) the combination of V_{i,1}f_{i,1,2}V_{i,2} and the combination of V_{i,2}f_{i,2,2}V_{i+1}_{,1} where vertex v_{i} is in the VC. This lets us use the pointers both for the part of the string with the V’s and f’s, but also for the $v_{i}$ strings in the E components, saving space.

In the other direction, if we have a compressed string of size K’, he shows that means that you have compress a string like the above and the overlaps show you the vertices in the cover.

**Difficulty: **_{8}. I think to really get what’s happening, you need to actually build an example on a graph. But I think the idea of building the sets so that that is the overlap you’re looking for is something that can be gotten eventually.