Category Archives: Appendix: Storage and Retrieval

External Macro Data Compression

These next problems are related “macro compression” problems.

The problem: External Macro Data Compression.  This is problem SR22 in the appendix.

The description: Given a string s over some alphabet Σ, a “pointer cost” h and a bound B, can we find two strings D and C over the alphabet of Σ augmented with some number (< |s|) of “pointer characters pi such that:

  • |D| + |C| + (h-1)* (# of p characters in D and C) ≤ B
  • We can generate s by replacing pointer characters in C with their “meaning” in D.

Example: The concept they’re trying to get at isn’t that hard, it’s just hard to explain using mathematical language.  Suppose s = “ABCDEFABCGABCDEF”

Then we can define D to be “ABCDEF” and C to be “pqpGpq”.  The total size of this is 6 for D, plus 6 for C.  There are 5 pointers, so our total cost is 6+6+5h.

The idea is that the characters p and q are “pointers” that refer to substrings in D (p refers to “ABC” and q refers to “DEF”).  By replacing those pointers with what they “mean” in C, we can get back s.

A tricky part of this is that you are allowed to have substrings overlap.  So if s was “ABCDBCD” we could define D to be “ABCD” and C to be “pq” with p meaning “ABCD” and q meaning “BCD”.  Now our total cost is 4 for D, 2 for C, and 2 pointers, so 4+2+h.

Reduction:  The reduction (and the one for SR23) comes from a technical report by Storer, which is pretty dense.  I think we’re looking at “Theorem 2” in the paper, which is from VC<=3.

The alphabet that will be built from the VC instance has a lot of parts:

  • A special symbol $
  • 3 symbols vi, ai, and bi for each vertex vi in the graph
  • 4 more symbols fi,1,1 through fi,2,2 for each vertex vi in the graph
  • one symbol di for each edge ej in the graph.
  • 2 symbols c1 and c2 (there is actually one c symbol for each value of h.  We’ll assume h=2 here)
  • 3 symbols g1,1 g1,2 and g2,1 (This is also based on h.  So really you’d go from g1,1 through gh-1,2 and add gh,1)

The string s will also be built from parts (a lot of these are based on h has well, again, I’m fixing h=2 to keep it simpler)

  • Vi,l = ai$vi
  • Vi,2 = vi$bi
  • For each edge ei= (vj, vk), Ei = $vj$vj$
  • We also have Z1 = (c1)3 (so 3 copies of c1) and Z2 = (c2)3

s is:

  • Eidi  concatenated for each edge, followed by
  • Vi,jfi,j,k concatenated over each vertex and each possible f symbol, followed by
  • Z1g1,1Z1g1,2, followed by
  • Z2g2,1Z2

K’ = |s| + K – (7/2)|V|.

The basic idea from here is that if G has a VC, we can compress s by making pointers for (among other things) the combination of Vi,1fi,1,2Vi,2 and the combination of Vi,2fi,2,2Vi+1,1  where vertex vi is in the VC.  This lets us use the pointers both for the part of the string with the V’s and f’s, but also for the $vi$ strings in the E components, saving space.

In the other direction, if we have a compressed string of size K’, he shows that means that you have compress a string like the above and the overlaps show you the vertices in the cover.

Difficulty: 8.  I think to really get what’s happening, you need to actually build an example on a graph.  But I think the idea of building the sets so that that is the overlap you’re looking for is something that can be gotten eventually.

String-To-String Correction

This one reminds me of an old puzzle problem.

The problem: String-To-String Correction.  This is problem SR20 in the appendix.

The description: Given two strings x, and y over a finite alphabet ∑, and an integer K.  Can we start from x and derive y in K or fewer steps, where each step is either deleting a symbol or interchanging a symbol.

Example: The puzzles I’m thinking of are the ones that say “Can you go from HELLO to OLE in 6 steps?”

  • ELLO
  • ELO
  • EOL
  • OEL
  • OLE

The reason this is a hard problem is when you have repeated symbols, and need to “choose” which ones to delete.  A slightly harder example is to go from ABABABABBB to BAAB.  Now which A’s and B’s you choose to delete have an effect on the total number of moves.

The paper by Wagner that has the reduction makes the point that if you replace “Delete” with “Insert”, you get basically the same problem but go from y to x instead of from x to y.  The paper gives other variants that have polynomial solutions (allowing insertions and deletions and changes of a character, whether or not you allow interchange).

Reduction: Wagner uses Set Covering.  So we’re given a collection of sets all subsets of some set W, and an integer K.

Define t = |W| and r = t2.  Pick 3 symbols Q,R, and S that are not in W.  The string x  will be:


..concatenated together each time for each ci.  (So the copy that has c1 is first, then another copy with c2, and so on.

The string y will be n copies of the substring RQr, followed by w, followed by n copies of the substring Sr+1

The K’ for this problem is (K+1)*r-1 + 2t(r-1) + n(n-1)(r+1)2/2 + d.

d is r*n+|c1…c2| – |W|.

From here, Wagner explains that the way the correction problem works is by deciding which copies of the R, Qr and Sr+1  sets in x match to the corresponding ones in y.  By counting all of the operations that “have” to be done (by deleting or moving characters), he shows that the a correction problem that has weight of K’ has to cross over certain ci sets to make a covering, and that there have to be K or less of those sets.

Difficulty: 8.  The construction is pretty crazy.  You can eventually see where the components come from, but I can’t imagine how he came up with these things (especially the x and y strings) in the first place.

Two-Dimensional Consecutive Sets

This problem is harder to explain than the last couple.  Let’s see how it goes.

The problem: Two-Dimensional Consecutive Sets.  This is problem SR19 in the appendix.

The description: Given a finite alphabet ∑, and a collection C = {∑1..∑n}  of subsets of ∑, can we partition ∑ into some some number of sets X1..Xk such that:

  • Each Xi has at most one element in each set in C,
  • For each ∑j in C, there is an “index” l(j) (a number ≥ 0) where ∑j is in  Xl(j) ∪ Xl(j)+1  ∪ … ∪ Xl(j)+m-1, where m =|∑j|

Example: The “index” stuff is basically there to say “you can make each set in C by taking one element from a consecutive X sets”.  So, if C = {{a,b,c}, {b,e}, {b,c,e}, and {b,c} over the alphabet {a,b,c,d,e} then we can make X = {{a,e}, {b}, {c,d}}

Then we can get each set in C by taking one element from consecutive sets in X.

An example of a C that can’t be realized would be: {{a,b}, {b,c}, {b,d}, {a,b,c,d}}

The element b needs to be in a set X that is adjacency to the sets that contain a,c, and d.  Since the set with b can only be adjacent to at most 2 other sets in X, that means that some of the elements in a,c, and d need to be in the same set.  But the set {a,b,c,d} prevents that from happening.

Reduction: G&J’s reference is to a paper by Lipski, which uses Graph Coloring.  So we’re given a graph and an integer K.  We’re going to build an alphabet that has one symbol for each vertex in V, K+2 “b” elements (numbered from 0 to k+1), and K-2 “a” elements (numbered from 1 to k-2) for each edge.  (There is only one copy of the “b” elements.  There is one set of the “a” elements for each edge).  The sets in C are:

  • One set containing all of the “b” elements.
  • One set for each consecutive pair of “b” elements (so {b0, b1} all the way through {bk, bk+1})
  • For each edge (v,w), a set containing the vertex symbols for v and w, all k-2 “a” symbols for that edge, b0, and bk+1

If G has a K-Coloring, we can split ∑ into k+2 pieces, numbered X0 to Xk+1  (Notice that because of the set with all of the “b” elements in C, we need to have at least that many sets in X).  Each Xi gets its corresponding “b” element.  For each edge (v,w), we know that vertices v and w each have different colors (numbered from 1-K).  We put v’s symbol in the X set numbered with v’s color, and we put w’s symbol in the X set numbered with w’s color.  We put the remaining “a” symbols for that edge in the other k-2 X sets between 1 and N.  This creates a two-dimensional partition  of all of the sets in C.

If we have a two-dimensional partition of C, then list them out in order of how they sort the “b” elements (So the set with b0 is X0 and so on).  It’s possible that we have to do a reverse ordering (the set with b0 is Xk+1 and the set with bk+1 is X0), but we can treat that as basically the same.  Since the sets in C that were based on edges contain b0 and bk+1, the other elements in that set need to fit in sets X1 through Xk.  So the partition has to have exactly K+2 sets, and the vertices all fall in sets X1 through Xk with no two vertices sharing an edge in the same set.  This gives a K-coloring of the vertices in the graph.

Difficulty: 7.  It sort of makes sense that the “two-dimensional” problem needs us to create a “two-dimensional” set of elements (the “a” elements), but I don’t think I would expect a student to be able to come up with these elements and these steps without a lot of help.

Consecutive Sets

This one moves slightly away from the matrix stuff we’ve been doing, but results in a pretty cool reduction.

The problem: Consecutive Sets.  This is problem SR18 in the appendix.

The description: Given a collection C of subsets of a finite alphabet Σ, and a positive integer K, can I find a string w from ∑* with K symbols or less, such that each element in C has its characters occur sequentially in w?

Example: Let ∑ = {a,b,c,d,e} and C = {{a,b,c}, {a,d,e}, {b,e}, {c,d}}

Then I can choose w to be dcbadeb, which is length 7, which is the best I could find. Notice how positions 2-4 contain {a,b,c}, positions 4-6 contain {a,d,e}, positions 6-7 contain {b,e}, and positions 1-2 contain {c,d}.

Reduction: This is “Problem 4” in the paper by Kou from last week.  He reduces from Hamiltonian Path.  So we start with a graph G=(V,E).  Our alphabet Σ (he uses “R” in the paper) has one symbol for each vertex in V and one symbol for each edge in E.  Our C set (he uses F in the paper) will have one set for each vertex.  The symbols in the set for a vertex i will consist of the edge symbols for the edges incident on i, and the vertex symbol for i itself.  K will be 2*|E|+1.

Notice that no two sets in C are subsets of each other (because they each hold a separate symbol from V), and that the intersection of 2 sets from C is a single edge (the edge connecting the two vertices corresponding to the sets), if that edge exists, and the intersection is empty otherwise.  A string w that we make will be based on a sequence of vertices (and their C sets).  Since each edge appears in two C sets, the “initial” size of w is 2|E| + |V|, if we have no overlap between C sets in w.  The question is how small can it get?

Since each C set can have at most one symbol of overlap with another C set, and that only happens if an edge exists between the vertices that define the C sets, the only way to create an overlap to reduce the size of w is to order the vertices such that there are edges between consecutive pairs of vertices.  G has a Hamiltonian Path if and only if we can order the vertices in such a way that there is an edge bewteen each part of consecutive vertices.  In w, that means we can overlap 1 symbol from each pair of consecutive C sets.  Since there are |V|-1 pairs of vertices in a path, the minimum size of w is 2|E| + |V| – (|V|-1) = 2|E| + 1 = K.

Difficulty: 5.  I think this is a neat reduction that students can get, though it might take a bit of work to make sure they understand the problem.

Consecutive Ones Matrix Partition

This is the problem I skipped over last week to get the two problems from Booth’s dissertation done together.

The problem: Consecutive Ones Matrix Partition.  This is problem SR15 in the appendix.

The description: Given an mxn matrix A consisting of 0’s and 1’s, can we partition the rows of A into two groups m1 and m2 such that the matrices A1 = m1x n and A2 = m2x n both have the consecutive ones property?

Example: Here’s the matrix that does not have the consecutive ones property.

0 1 1 1 1
1 1 1 0 1
0 0 1 1 0
0 1 1 0 0
1 1 0 0 0

But if we break it into two matrices, for example:

0 1 1 1 1
1 1 1 0 1
0 1 1 0 0


0 0 1 1 0
1 1 0 0 0

The two matrices have the consecutive ones property (the second matrix already has it, the top matrix has it if you exchange the 4th and 5th columns.)

Reduction: G&J point to a “private communication”, but I found a technical report by Lipski that explains the reduction.  The reduction is from Hamiltonian Path on cubic graphs (each vertex has degree at most 3).  So we start with a graph G=(V,E) where each vertex is degree 3 or less.  We construct a new graph G’=(V’,E’) where G’ adds new vertices and edges so that each vertex in V has degree 4.  Each new vertex has degree 1 (and so the new edges we add to V always connect to distinct new vertices).  Take the incidence matrix of this graph (as defined in the Consecutive Ones Submatrix problem) and use it as A.

If G has a Hamiltonian Path, then we can split A into two parts: the vertices in the path (thus, all of V) and everything else.  The Hamiltonian Path has the consecutive ones property, just like we saw in the Consecutive Ones Submatrix problem.  The other rows just map to disjoint paths of length 2 (two degree one edges connected by some vertex in V) so will be able to be rearranged.

If A has a partition that works, then choose some vertex v in V.  Suppose it ends up in A1.  We know that this vertex has degree 4.  We also know that to satisfy the consecutive ones property, v needs to be adjacent to exactly 2 things in A1 and 2 things in A2 for the consecutive ones property to work out.  The vertices in A1 need to form a chain of vertices all in V to make the consecutive ones property hold, so will form a Hamiltonian Path.

Difficulty: 6.  The paper glosses over a lot of detail, and I did to.  For example, the paper says “Without loss of generality we may assume that V ∈ M1” (the paper’s version of A1.  I don’t think it’s obvious at all that the entirety of V will end up all in one partition or the other.  I think it’s true, but I think it needs more work from a student to explain why it works.

Consecutive Ones Matrix Augmentation

While I had the dissertation I needed for the last problem, and before I had to send it back (without making copies), I skipped ahead to another problem that used it as a reference.

The problem: Consecutive Ones Matrix Augmentation.  This is problem SR16 in the appendix.

The description: Given an mxn matrix of 0’s and 1’s and a positive integer K, can we change K or fewer 0’s into 1’s to give the matrix the consecutive ones property.  See the Consecutive Ones Submatrix problem for more detail on this property.

Example: Here is the matrix that did not have the property from last week:

0 1 1 1 1
1 1 1 0 1
0 0 1 1 0
0 1 1 0 0
1 1 0 0 0

Recall that we’re looking for a way to  permute the columns so that there are no gaps in the 1’s in each row.  The easy solution here is to add a 1 in the second row (making it all 1’s).  In that case, the matrix has the consecutive ones property without needing any column permutations.

Reduction: Again, it is found in the dissertation by Booth, but this time uses Optimal Linear Arrangement.  Recall in that problem we were given a graph and needed to find an ordering of vertices to minimize the “score” of the differences of the ordering values.

The corresponding idea here is to start with the graph and build an incidence matrix, just like in the Consecutive Ones Submatrix problem.  We notice that any permutation of the rows (in the dissertation, you permute the rows to find consecutive ones in each column) corresponds to a permutation of the vertices in the OLA problem.  If we took a permutation of the vertices, each edge (column) of the incidence matrix has exactly two 1’s in it.  There may or may not be any 0’s between those ones, but if there are, we can convert those 0’s into 1’s (this is the augmentation).

The way these problems relate is that if an edge exists between two vertices, vi and vj, located at positions i and j in the final permutation, that edge adds j-i to the cost (assuming j > i).  In our matrix, if we permute the row corresponding to vertex vi to row i, and permute the row corresponding to vertex vj to row j, then there are j-i-1 o’s between the two rows.  So we could set them all to 1, and that column would have the consecutive ones property.

Thus, a solution of cost K to the OLA problem corresponds to a solution to the Consecutive Ones Matrix Augmentation problem of cost K-|E|, since there are |E| columns in the incidence matrix, and each one differs by 1 from the OLA cost of the edge.

Difficulty: 6, mainly because of the incidence matrix/adjacency matrix difficulty I talked about last time, but also because OLA is not easy to get.  If you’ve already shown them incidence matrices (for example, because you’ve gone over the Consecutive Ones Submatrix problem), then this is probably a 4.

Protected: Consecutive Ones Submatrix

This content is password protected. To view it please enter your password below:

Protected: Sparse Matrix Compression

This content is password protected. To view it please enter your password below:

Protected: Hitting String

This content is password protected. To view it please enter your password below: