Internal Macro Data Compression

I’m finally back from my trip.  Apologies for the delay.

The problem: Internal Macro Data Compression.  This is problem SR23 in the appendix.

The description: Given an alphabet Σ, a string s in Σ*, a “pointer cost” h and a bound B, can we find a string C in the alphabet of Σ augmented with at most s “pointer characters” pi such that:

  • |C|+ (h-1)* (# of pointer characters in C) is ≤ B, and
  • We can replace pointer characters with substrings of C to regain s?

Example: This is similar to the last macro compression problem, but instead of having 2 strings C and D, C serves both purposes.  In our example from last time, we had s = “ABCDEFABCGABCDEF”, and came up with C= “pqpGpq” and D= “ABCDEF”.  The difference here is that there is no D string, we have to do it all in C.  We can do that by letting C = “ABCDEFqpGpq”.  Now if we let p=”ABC” (or, “the substring of C of length 3 starting at position 1” and q = “DEF” (or, “the substring of C of length 3 starting at position 4”, we get a string of cost 11+3h that gives us back s if we replace all pointers with the strings they represent.

Reduction: The paper by Storer that we’ve been using for the last few problems also has this reduction.  I think this is his “CPM” problem.  He uses the “K-Vertex Cover” problem, in the case where K=1.   Recall in that case we are looking for a set of edges E’ such that each edge in E shares a vertex with an edge in E’. We start with a graph, which is an instance of this problem.  The alphabet we build will have:

  • A special symbol $
  • One symbol for each vertex and edge in G

For each edge ei =  (vj, vk), define the substring Ei = $vj$vk$

Our s = the concatenation of eiEi for all edges i (So the symbol for the edge followed by the substring of the edge).  B = J + |S| – |E|.

The idea (and I’ll admit I have a hard time following this paper) is that if G has a 1-VC, then we have J edges that form a set E’ such that each edge in E is adjacent to something in E’.  This means that we can overlap those edges and make pointers out of the Ei strings to save enough characters.

Difficulty: 9.  I’m sure this can be explained better, but I really have a hard time following it.

K-Vertex Cover

This week we do a variant of Vertex Cover that I don’t think I’ve seen before, but we’ll use for the next problem.

After this week, I’m going to be traveling for three weeks.  So the next post will be the week of August 14.

The problem: K-Vertex Cover.  This problem is not in the appendix.

The description: Given a graph G=(V,E), and integers K and J.  Does G contain a set of J or less paths, where each path contains K or less edges, and each edge in E is incident on at least one of these paths?

Example:  Here is a graph:

A K-Vertex Cover where K = 0 is just a regular Vertex Cover:

With K=1, this is “Find a subset E’ of E where each edge in E shares an endpoint with something in E'”:

With K=2, we are finding paths of 2 (or less) edges where each edge in E is adjacent to something in a path (I colored each path a different color to make them easier to tell apart):

Notice that K is set as a parameter of the problem.

Reduction: The report by Storer that had the previous reduction (and which will have the next one) introduces this problem.  The reduction is from Vertex Cover.  I guess we could just take the initial VC instance, set K=0, and be done, but I like this construction because it shows how to make 1-VC, 2-VC, and so on NP-Complete.  (In other words, if K is fixed outside of the problem, you need this reduction).  Interestingly, a |V|-vertex cover is trivial (it’s asking if the graph has J or less connected components).

Given our VC instance G=(V,E), with bound J (instead of K, just because K here means something different and the J for our problem will be the same J for the VC instance), build a new graph G’=(V’, E’):

V’ starts with V, and adds two new vertices for each K and each J (so 2*K*J new vertices) labeled x1,1 through xj,k and y1,1 through yj,k.  Notice that since J and K should be ≤ |V| (or the answer is trivial), this only adds a polynomial number of vertices.

E’ starts with E, adds an edge from each vertex in V to all x vertices labeled xj,1, an edge from each xi,j to its corresponding yi,j, and a vertex from xi,j to the next xi,k+1

The idea is that since each x vertex needs to connect to a y vertex, we will need to include a path that goes through each x vertex in our k-cover.  Since there are J paths of length K, and each vertex in v connected to all xj,1, what happens is that each of the J vertices chosen in the VC of G “chooses” a different path of x vertices.  So we can cover all of the vertices in G’ with J paths if and only if we can cover all of the vertices in G with J vertices.

Difficulty: 5.  This is a pretty cool reduction.  I like the idea of adding K *J copies of the vertices, because while you can do that for a graph problem (since K and J need to be < |V| realistically), you can’t do that for other kinds of problems.  For example, trying to create an element for each value from 1 to K in the Sum of Subsets problem won’t give you a polynomial reduction.

External Macro Data Compression

These next problems are related “macro compression” problems.

The problem: External Macro Data Compression.  This is problem SR22 in the appendix.

The description: Given a string s over some alphabet Σ, a “pointer cost” h and a bound B, can we find two strings D and C over the alphabet of Σ augmented with some number (< |s|) of “pointer characters” pi such that:

  • |D| + |C| + (h-1)* (# of p characters in D and C) ≤ B
  • We can generate s by replacing pointer characters in C with their “meaning” in D.

Example: The concept they’re trying to get at isn’t that hard, it’s just hard to explain using mathematical language.  Suppose s = “ABCDEFABCGABCDEF”

Then we can define D to be “ABCDEF” and C to be “pqpGpq”.  The total size of this is 6 for D, plus 6 for C.  There are 5 pointers, so our total cost is 6+6+5h.

The idea is that the characters p and q are “pointers” that refer to substrings in D (p refers to “ABC” and q refers to “DEF”).  By replacing those pointers with what they “mean” in C, we can get back s.

A tricky part of this is that you are allowed to have substrings overlap.  So if s was “ABCDBCD” we could define D to be “ABCD” and C to be “pq” with p meaning “ABCD” and q meaning “BCD”.  Now our total cost is 4 for D, 2 for C, and 2 pointers, so 4+2+h.

Reduction:  The reduction (and the one for SR23) comes from a technical report by Storer, which is pretty dense.  I think we’re looking at “Theorem 2” in the paper, which is from VC<=3.

The alphabet that will be built from the VC instance has a lot of parts:

  • A special symbol $
  • 3 symbols vi, ai, and bi for each vertex vi in the graph
  • 4 more symbols fi,1,1 through fi,2,2 for each vertex vi in the graph
  • one symbol di for each edge ej in the graph.
  • 2 symbols c1 and c2 (there is actually one c symbol for each value of h.  We’ll assume h=2 here)
  • 3 symbols g1,1 g1,2 and g2,1 (This is also based on h.  So really you’d go from g1,1 through gh-1,2 and add gh,1)

The string s will also be built from parts (a lot of these are based on h has well, again, I’m fixing h=2 to keep it simpler)

  • Vi,l = ai$vi
  • Vi,2 = vi$bi
  • For each edge ei= (vj, vk), Ei = $vj$vj$
  • We also have Z1 = (c1)3 (so 3 copies of c1) and Z2 = (c2)3

s is:

  • Eidi  concatenated for each edge, followed by
  • Vi,jfi,j,k concatenated over each vertex and each possible f symbol, followed by
  • Z1g1,1Z1g1,2, followed by
  • Z2g2,1Z2

K’ = |s| + K – (7/2)|V|.

The basic idea from here is that if G has a VC, we can compress s by making pointers for (among other things) the combination of Vi,1fi,1,2Vi,2 and the combination of Vi,2fi,2,2Vi+1,1  where vertex vi is in the VC.  This lets us use the pointers both for the part of the string with the V’s and f’s, but also for the $vi$ strings in the E components, saving space.

In the other direction, if we have a compressed string of size K’, he shows that means that you have compress a string like the above and the overlaps show you the vertices in the cover.

Difficulty: 8.  I think to really get what’s happening, you need to actually build an example on a graph.  But I think the idea of building the sets so that that is the overlap you’re looking for is something that can be gotten eventually.

String-To-String Correction

This one reminds me of an old puzzle problem.

The problem: String-To-String Correction.  This is problem SR20 in the appendix.

The description: Given two strings x, and y over a finite alphabet ∑, and an integer K.  Can we start from x and derive y in K or fewer steps, where each step is either deleting a symbol or interchanging a symbol.

Example: The puzzles I’m thinking of are the ones that say “Can you go from HELLO to OLE in 6 steps?”

  • ELLO
  • ELO
  • EOL
  • OEL
  • OLE

The reason this is a hard problem is when you have repeated symbols, and need to “choose” which ones to delete.  A slightly harder example is to go from ABABABABBB to BAAB.  Now which A’s and B’s you choose to delete have an effect on the total number of moves.

The paper by Wagner that has the reduction makes the point that if you replace “Delete” with “Insert”, you get basically the same problem but go from y to x instead of from x to y.  The paper gives other variants that have polynomial solutions (allowing insertions and deletions and changes of a character, whether or not you allow interchange).

Reduction: Wagner uses Set Covering.  So we’re given a collection of sets all subsets of some set W, and an integer K.

Define t = |W| and r = t2.  Pick 3 symbols Q,R, and S that are not in W.  The string x  will be:


..concatenated together each time for each ci.  (So the copy that has c1 is first, then another copy with c2, and so on.

The string y will be n copies of the substring RQr, followed by w, followed by n copies of the substring Sr+1

The K’ for this problem is (K+1)*r-1 + 2t(r-1) + n(n-1)(r+1)2/2 + d.

d is r*n+|c1…c2| – |W|.

From here, Wagner explains that the way the correction problem works is by deciding which copies of the R, Qr and Sr+1  sets in x match to the corresponding ones in y.  By counting all of the operations that “have” to be done (by deleting or moving characters), he shows that the a correction problem that has weight of K’ has to cross over certain ci sets to make a covering, and that there have to be K or less of those sets.

Difficulty: 8.  The construction is pretty crazy.  You can eventually see where the components come from, but I can’t imagine how he came up with these things (especially the x and y strings) in the first place.

Two-Dimensional Consecutive Sets

This problem is harder to explain than the last couple.  Let’s see how it goes.

The problem: Two-Dimensional Consecutive Sets.  This is problem SR19 in the appendix.

The description: Given a finite alphabet ∑, and a collection C = {∑1..∑n}  of subsets of ∑, can we partition ∑ into some some number of sets X1..Xk such that:

  • Each Xi has at most one element in each set in C,
  • For each ∑j in C, there is an “index” l(j) (a number ≥ 0) where ∑j is in  Xl(j) ∪ Xl(j)+1  ∪ … ∪ Xl(j)+m-1, where m =|∑j|

Example: The “index” stuff is basically there to say “you can make each set in C by taking one element from a consecutive X sets”.  So, if C = {{a,b,c}, {b,e}, {b,c,e}, and {b,c} over the alphabet {a,b,c,d,e} then we can make X = {{a,e}, {b}, {c,d}}

Then we can get each set in C by taking one element from consecutive sets in X.

An example of a C that can’t be realized would be: {{a,b}, {b,c}, {b,d}, {a,b,c,d}}

The element b needs to be in a set X that is adjacency to the sets that contain a,c, and d.  Since the set with b can only be adjacent to at most 2 other sets in X, that means that some of the elements in a,c, and d need to be in the same set.  But the set {a,b,c,d} prevents that from happening.

Reduction: G&J’s reference is to a paper by Lipski, which uses Graph Coloring.  So we’re given a graph and an integer K.  We’re going to build an alphabet that has one symbol for each vertex in V, K+2 “b” elements (numbered from 0 to k+1), and K-2 “a” elements (numbered from 1 to k-2) for each edge.  (There is only one copy of the “b” elements.  There is one set of the “a” elements for each edge).  The sets in C are:

  • One set containing all of the “b” elements.
  • One set for each consecutive pair of “b” elements (so {b0, b1} all the way through {bk, bk+1})
  • For each edge (v,w), a set containing the vertex symbols for v and w, all k-2 “a” symbols for that edge, b0, and bk+1

If G has a K-Coloring, we can split ∑ into k+2 pieces, numbered X0 to Xk+1  (Notice that because of the set with all of the “b” elements in C, we need to have at least that many sets in X).  Each Xi gets its corresponding “b” element.  For each edge (v,w), we know that vertices v and w each have different colors (numbered from 1-K).  We put v’s symbol in the X set numbered with v’s color, and we put w’s symbol in the X set numbered with w’s color.  We put the remaining “a” symbols for that edge in the other k-2 X sets between 1 and N.  This creates a two-dimensional partition  of all of the sets in C.

If we have a two-dimensional partition of C, then list them out in order of how they sort the “b” elements (So the set with b0 is X0 and so on).  It’s possible that we have to do a reverse ordering (the set with b0 is Xk+1 and the set with bk+1 is X0), but we can treat that as basically the same.  Since the sets in C that were based on edges contain b0 and bk+1, the other elements in that set need to fit in sets X1 through Xk.  So the partition has to have exactly K+2 sets, and the vertices all fall in sets X1 through Xk with no two vertices sharing an edge in the same set.  This gives a K-coloring of the vertices in the graph.

Difficulty: 7.  It sort of makes sense that the “two-dimensional” problem needs us to create a “two-dimensional” set of elements (the “a” elements), but I don’t think I would expect a student to be able to come up with these elements and these steps without a lot of help.

Consecutive Sets

This one moves slightly away from the matrix stuff we’ve been doing, but results in a pretty cool reduction.

The problem: Consecutive Sets.  This is problem SR18 in the appendix.

The description: Given a collection C of subsets of a finite alphabet Σ, and a positive integer K, can I find a string w from ∑* with K symbols or less, such that each element in C has its characters occur sequentially in w?

Example: Let ∑ = {a,b,c,d,e} and C = {{a,b,c}, {a,d,e}, {b,e}, {c,d}}

Then I can choose w to be dcbadeb, which is length 7, which is the best I could find. Notice how positions 2-4 contain {a,b,c}, positions 4-6 contain {a,d,e}, positions 6-7 contain {b,e}, and positions 1-2 contain {c,d}.

Reduction: This is “Problem 4” in the paper by Kou from last week.  He reduces from Hamiltonian Path.  So we start with a graph G=(V,E).  Our alphabet Σ (he uses “R” in the paper) has one symbol for each vertex in V and one symbol for each edge in E.  Our C set (he uses F in the paper) will have one set for each vertex.  The symbols in the set for a vertex i will consist of the edge symbols for the edges incident on i, and the vertex symbol for i itself.  K will be 2*|E|+1.

Notice that no two sets in C are subsets of each other (because they each hold a separate symbol from V), and that the intersection of 2 sets from C is a single edge (the edge connecting the two vertices corresponding to the sets), if that edge exists, and the intersection is empty otherwise.  A string w that we make will be based on a sequence of vertices (and their C sets).  Since each edge appears in two C sets, the “initial” size of w is 2|E| + |V|, if we have no overlap between C sets in w.  The question is how small can it get?

Since each C set can have at most one symbol of overlap with another C set, and that only happens if an edge exists between the vertices that define the C sets, the only way to create an overlap to reduce the size of w is to order the vertices such that there are edges between consecutive pairs of vertices.  G has a Hamiltonian Path if and only if we can order the vertices in such a way that there is an edge bewteen each part of consecutive vertices.  In w, that means we can overlap 1 symbol from each pair of consecutive C sets.  Since there are |V|-1 pairs of vertices in a path, the minimum size of w is 2|E| + |V| – (|V|-1) = 2|E| + 1 = K.

Difficulty: 5.  I think this is a neat reduction that students can get, though it might take a bit of work to make sure they understand the problem.

Consecutive Block Minimization

These next two problems are very similar to the “consecutive ones” problems we’ve been looking at lately.

The problem: Consecutive Block Minimization.  This is problem SR17 in the appendix.

The description: Given an m x n matrix A consisting of 0’s and 1’s and a positive integer K, can we permute the columns of A so that the permuted matrix (B) has K or less blocks of consecutive 1’s.  G&J define a “block” as an entry bij =1 where bi,j+1 = 0 or j = n.  So a “block” is one-dimensional, going across a row.

Example: Here’s a matrix:

1 0 1 0 1
1 1 1 0 1
0 0 1 1 0
0 1 1 0 0
1 1 0 0 1

Right now, this has 9 blocks (3 on the first row, 2 on the second and fifth, 1 on the third and fourth).

But we can permute the columns to get:

0 1 1 1 0
1 1 1 1 0
0 0 0 1 1
1 0 0 1 0
1 1 1 0 0

..for a total of 6 blocks.

Reduction: The paper by Kou has several reductions in it, we’ll be using this paper next week too.  This problem is “Problem 3” in the paper.  The problem uses Hamiltonian Path, but with costs for edges, so it’s really more like “Traveling Salesman on Paths”.   His “Problem 2” takes an instance of regular Hamiltonian Path and shows that this “Traveling Salesman on Paths” problem is NP-Complete on “cost graphs”.

Kou defines a “cost graph” of a matrix as a complete graph that has one vertex for each row and an edge cost of edge (i,j) as  the number of columns in A that have a 1 in row i and a 0 in row j.  He proves in the paper that a Hamiltonian Path on the cost graph has cost k iff there are k+c blocks of 1’s in the matrix, where c is the number of 1’s in the last row in the matrix.

So, the full reduction takes an instance of HP, creates an incidence matrix (as we defined in the Consecutive Ones Matrix Augmentation problem),  builds a cost graph of that, and uses that cost graph as the input to the Consecutive Block minimization problem.

Difficulty: 7. The cost graph construction is pretty hard to grasp, and you need to prove some extra lemmas along the way to show that the transformation works.

Consecutive Ones Matrix Partition

This is the problem I skipped over last week to get the two problems from Booth’s dissertation done together.

The problem: Consecutive Ones Matrix Partition.  This is problem SR15 in the appendix.

The description: Given an mxn matrix A consisting of 0’s and 1’s, can we partition the rows of A into two groups m1 and m2 such that the matrices A1 = m1x n and A2 = m2x n both have the consecutive ones property?

Example: Here’s the matrix that does not have the consecutive ones property.

0 1 1 1 1
1 1 1 0 1
0 0 1 1 0
0 1 1 0 0
1 1 0 0 0

But if we break it into two matrices, for example:

0 1 1 1 1
1 1 1 0 1
0 1 1 0 0


0 0 1 1 0
1 1 0 0 0

The two matrices have the consecutive ones property (the second matrix already has it, the top matrix has it if you exchange the 4th and 5th columns.)

Reduction: G&J point to a “private communication”, but I found a technical report by Lipski that explains the reduction.  The reduction is from Hamiltonian Path on cubic graphs (each vertex has degree at most 3).  So we start with a graph G=(V,E) where each vertex is degree 3 or less.  We construct a new graph G’=(V’,E’) where G’ adds new vertices and edges so that each vertex in V has degree 4.  Each new vertex has degree 1 (and so the new edges we add to V always connect to distinct new vertices).  Take the incidence matrix of this graph (as defined in the Consecutive Ones Submatrix problem) and use it as A.

If G has a Hamiltonian Path, then we can split A into two parts: the vertices in the path (thus, all of V) and everything else.  The Hamiltonian Path has the consecutive ones property, just like we saw in the Consecutive Ones Submatrix problem.  The other rows just map to disjoint paths of length 2 (two degree one edges connected by some vertex in V) so will be able to be rearranged.

If A has a partition that works, then choose some vertex v in V.  Suppose it ends up in A1.  We know that this vertex has degree 4.  We also know that to satisfy the consecutive ones property, v needs to be adjacent to exactly 2 things in A1 and 2 things in A2 for the consecutive ones property to work out.  The vertices in A1 need to form a chain of vertices all in V to make the consecutive ones property hold, so will form a Hamiltonian Path.

Difficulty: 6.  The paper glosses over a lot of detail, and I did to.  For example, the paper says “Without loss of generality we may assume that V ∈ M1” (the paper’s version of A1.  I don’t think it’s obvious at all that the entirety of V will end up all in one partition or the other.  I think it’s true, but I think it needs more work from a student to explain why it works.