Tag Archives: Shortest Common Superstring

Sparse Matrix Compression

Here’s the problem I’ve been dreading for months.  I haven’t been able to track down the reference, and I haven’t been able to come up with a full reduction on my own.  I’ve decided that I fundamentally don’t understand the way G&J are explaining the problem and have come up with an alternate problem statement that I think meets the same goals.

The problem: Sparse Matrix Compression.  This is problem SR13 in the appendix.

G&J’s description: Given an mxn matrix A where each entry in the matrix is either 0 or 1, and an integer K ≤ m*n.  Can we find a sequence B of n+K integers, each of which is between 0 and m (inclusive), and a function s mapping the numbers from 1 to m to the numbers 1 to K, such that aij = 1 if and only if bs(i)+j-1 = i?

Example: So what does this have to do with compression, you may be asking? The idea is that the b sequence will tell us how to find runs of 1’s in various rows, and the s function tells us where each row’s sequence begins.

So, suppose this is our matrix:

1 0 1 0
0 0 1 1
0 0 0 1
1 1 0 0

The first row implies that the B sequence needs to have 1x1x in it (the x’s can be anything that’s not a 1)

The second row implies that the B sequence needs xx22 in it.

The third row implies that the B sequence needs xxx3

The fourth row implies that the B sequence needs 44xx

The shortest way I can make all of that happen is 4413122X.  In this case, s(1) = 2,  (the sequence starting at position 2: 4131 is the sequence for the first row), s(2) = 4, s(3)=1, s(4) = 1.  The X at the end can be any number, but needs to be there because the sequence needs to be of size n+K, and  the s values need to map from 1-K, and we have s(2) = 4.

(Non-)Reduction: G&J’s reference is to an “unpublished manuscript”, and they say to use graph 3- coloring, and to set K=3 (presumably related to the number of colors).  I’ve been thinking about this for months, and I think I just misunderstand the problem.   I have a real problem trying to understand the use of the K values- setting K=3 means that each row can only map to 1,2, or 3 via the s function.  And if two (or more!) rows have the same s value that means they can have no 1’s in common.  So since there are only 3 s values, if the graph is colorable, that means every row in the matrix needs to map into 3 groups of rows, where for each row they have no 1’s in common.

Which seems like the start of a good idea (the three groups are the three colors of vertices, two rows go into the same group if they can be colored the same and don’t share an edge), except that then you have to come up with a way to eventually get a number in the sequence for each 1 in each row and you need to worry about the fact that the groups that map to the other colors are starting right next to where your group starts (since every s(i) has to map to 1,2,or 3, they all have to start at one of the first 3 positions in the sequence, so for example b4 is the 4th thing in sequence 1, the 3rd thing in sequence 2, and the third thing in sequence 3.  So there needs to be at most one 1 out of all of those overlapping rows), and I couldn’t make this all work out.

Instead, I started thinking about how this problem looks an awful lot like Shortest Common Superstring– We need a common string that holds all of the strings corresponding to each row.

So, let me rewrite the problem as follows: Can we come up with a sequence B of 0’s and 1’s, and a function s such that bs(i)+j-1 = aij? Using this formulation, the best solution I could come up with for my example above was 1100011010.  s(1) = 7, s(2) = 4, s(3) = 3, s(4) = 1.  But on more sparse (or more similar) matrices, the string can get much shorter: If the matrix was 4 copies of the same row 0011, then 0011 can be the result string and s(i) = 1 for all i.  The 4×4 identity matrix can have the b sequence 0001000 and s(1) = 4, s(2) = 3, s(3)=2, s(1) = 1

This is “equivalent” to G&J’s description in the sense that it captures the same desires of the original problem- take a two-dimensional matrix that consists of 0’s and 1’s, and generate a (preferably short) one-dimensional representation with an index function that tells you where each row starts.  In this new formulation, the s function points to the start of each string, and now aij = 1 if and only if the corresponding element in the sequence is 1 (instead of i).  What isn’t so obvious to make equivalent is the K in the original problem (the number of “starting positions” you need to add to N to make the sequence) and the K here (the size of the resulting sequence).  I think it has to do with the structure of the underlying pattern of 0’s and 1’s in each row.  If there is a way to make those K’s translate in polynomial time, then this formulation is truly equivalent to G&J’s.

So now what we need is an instance of SCS where all strings are in the alphabet {0,1} (the problem is still NP-Complete in this case), make them be the rows of our matrix, and set the K’ of the Sparse Matrix Compression problem to be the K of the SCS problem, and it should work out.  I’m not 100% convinced this works, but it seems like it should.

Difficulty: 6.  Mainly because it’s so hard for me to understand the problem, but also seeing how it maps to SCS is not trivial.  But the problem as written by G&J is still unsolved.  Hopefully someone out there can explain how to solve that problem.

Protected: Shortest Common Superstring

This content is password protected. To view it please enter your password below: