Category Archives: Appendix: Storage and Retrieval

Boyce-Codd Normal Form Violation

This week’s and next week’s problems use basically the same construction in their reduction.  We’re going to do them in reverse order, because that’s the order they’re presented in the paper.

The problem: Boyce-Codd Normal Form Violation.  This is problem SR29 in the appendix.

The description: Given a set of attributes A, a collection F of functional dependencies, and a subset A’ of A, does A’ violate Boyce-Codd normal form?

G&J’s definition of BCNF is: A’ violates BCNF if we can find a subset X of A’ and 2 attributes y and z in A’-X such that the pair (X,y) is in the closure of F, and (X,z) is not in the closure of F.

The paper by Beeri and Bernstein define BCNF as: for all disjoint nonempty sets of attributes X and Y in A, If Y is functionally dependent on X, then all attributes in A depend on X.

Example: Here is an example from the Beeri and Bernstein paper that has the reduction:

Let A = {Apartment_Type, Address, Size, Landlord, Apartment_Num, Rent, Tenant_Name}

(I’m pretty sure “Apartment_Type” means something like “Two Bedroom”, “Size” is in square feet, “Address” is the street address of the apartment building, at which there are potentially many apartments each with a different “Apt_Num”)

With these functional dependencies:

  • ({Apartment_Type, Address}, {Size})
  • ({Address}, {Landlord})
  • ({Address, Apt_Num},{Rent})
  • ({Name}, {Address, Apt_Num})

If A’ is {Name, Address, Apt_Num, Rent} then BCNF is violated because Rent is functionally dependent on Name (through the pair of attributes {Address, Apartment_Num}, but other attributes (for example, Size) is not.

Reduction: Beeri and Bernstein reduce from Hitting Set.  So we start with a set S, and a collection C of subsets of S, and an integer K.  We need to create a set of attributes and functional dependencies.

Each element in S will be an attribute in A.  If we have a hitting set S’ of the attributes, then we could create a BCNF violation by making some of the attributes (but not all) of A functionally dependent on S’.

So we create a new attribute for each subset in C.  We create dependencies between the element attribute of each element in each C and its “set” attribute.  Now if we have a hitting set S’, then that set is functionally dependent on each element attribute.  So then if we make some but not all attributes in A dependent on the set of all element attributes, we get a BCNF violation.

That’s the basic idea, but the actual proof has a lot of special cases to worry about and resolve.

Difficulty: 8.  The general idea isn’t too hard (though you have to learn the definition of BCNF to even get started), but there is a ton of details and special cases to worry about to do it right.

Prime Attribute Name

We’re skipping ahead a bit because this reduction is from the same paper as last week’s (and reduces from last week’s problem).

The problem: Prime Attribute Name.  This is problem SR28 in the appendix.

The description: Given a set A of attribute names, and a collection F of functional dependencies like in the last problem, and a specific attribute x in A, is x in some key K for the relational system <A,F>?  (Recall that keys need to be of minimal size).

Example: Using the example from last week- the list of attributes was {Name, ID, Course, Professor, Time}.  The functional dependencies were:

  • ({Name},{ID})  (If you know a student’s name, you can figure out their ID)
  • ({ID},{Name})  (If you know a student’s ID, you can figure out their name)
  • ({Name, Course}, {Professor, Time}) (If you know a student’s name and a course they are in, you can figure out the professor teaching it and what time the course is)
  • ({Professor, Time), {Course}) (If you know the professor of the course and the time it is being taught, you can figure out the course being taught.  Note that you can’t figure out the student because more than one student will be in the course)

Recall that we said the pair of attributes {Name, Course} was a key for this system.  (So, if x was “Key” or “Course”, a solver for this problem should answer “yes”.)  But since Name and ID are related, ID is also a prime attribute.  I think  Professor and Time are not prime, because any subset of A that includes them and has the required characteristic is not minimal (If you take Professor, you’d have to add two other things- for example Course and ID).

Reduction: This reduction is in the same paper that has the Minimum Cardinality Key reduction, and in fact reduces from Minimum Cardinality Key.  Their terminology differs from what I typically use, but I’ll try to do it their way to be close to the paper.

The MCK instance starts with a set of attributes A’, a set of relations D[0]’ (what they call F), and a key size m (what we call k).  We need to build an instance of the PAN problem: a set of attributes A, a new set of functional dependencies D[0], and a specific attribute we want to check: b (what our definition calls x).

First, they define a set A”, which is “a set whose cardinality if the smaller of |A’| and m”. (I guess of all new elements).  They define A to be A”xA’ unioned with A’ unioned with a new attribute b.  They define D[0] with the following characteristics:

  • For all subsets E and F of A’, if E relates to F in D[0]’, E unioned with B relates to F in D[0]
  • A’ unioned with b relates to A”xA’
  • for each ordered pair (i,e) in A”xA’, b unioned with (i,e) relates to e.
  • for each element i in A’ and all distinct e,f in A, the set {(i,e), (i,f)} relates to b.
  • Each element e in A’ relates to b.

The idea now is that if D[0] has a minimal key that contains b, you need a bunch (N) of the (i,e) sets as well in the key to distinguish which ones relate to b.  Then the A’ part of the ordered pair gives you n elements in A’ that are a key for A’.

If A’ has a key of cardinality n <= m, then we know that A” has at least n elements, and adding b to the pairs of items in A’ and A” gets you a key for A.

Also, b has to be in the key for A, since removing it gets you a set that is not D[0] expandible,

Difficulty: 9.  This is really a lot to digest.  I’m not entirely sure I get what’s happening myself.

Minimum Cardinality Key

Now we enter the “Database Problems” subsection of the chapter.  I’ll admit that I’ve had trouble with a lot of these, partially because databases isn’t an area I know a ton about, but also because the definitions of the problems often obscure what’s really happening.

The problem: Minimum Cardinality Key.  This is problem SR26 in the appendix.

The description: Given a set A of attributes, a collection F of ordered pairs of subsets of A (“functional dependencies”), and a positive integer M, can we find a set K of A attributes or less that form a key for the relational system <A,F>.

G&J defines “form a key” as the ordered pair (K,A) belongs to the closure F* of F, with the following properties:

  • F ⊆ F*
  • B⊆C⊆A implies (C,B) ⊆ F*  (the “projective” closure)
  • (B,C) and (C,D) ⊆ F* implies (B,D) ⊆ F*  (the “transitive” closure)
  • (B.C) and (B,D) ⊆ F* implies (B,C∪D) ⊆ F* (the “additive” closure)

Example: The paper by Lucchesi and Osborn that has the reduction has a good example:

Suppose we have a database table for student records.  The attribute set A could be {Name, ID, Course, Professor, Time}.  Some functional dependencies, written as ordered pairs, could be:

  • ({Name},{ID})  (If you know a student’s name, you can figure out their ID)
  • ({ID},{Name})  (If you know a student’s ID, you can figure out their name)
  • ({Name, Course}, {Professor, Time}) (If you know a student’s name and a course they are in, you can figure out the professor teaching it and what time the course is)
  • ({Professor, Time), {Course}) (If you know the professor of the course and the time it is being taught, you can figure out the course being taught.  Note that you can’t figure out the student because more than one student will be in the course)

If these relations are F, then we can derive all of the attributes just from knowing the name of the student and the course they are taking.  In other words, F* contains the pair ({name, course}, (name, ID, course, professor, time).

Reduction:  The Lucchesi and Osborn paper use different terminology than G&J’s.  They define a relation D, defined in a series of inductive steps, where D[0] is F, and each D[i] is created from D[i-1] by doing one step of one of the closures on the elements in D[i-1].  The final closure (what G&J call F*) they call D.

They also define an expansion of a relation D: Given a subset B of our attribute set A, if a relation (L,R) is in D, where L is in B, but R isn’t, the set B∪R is a D-expansion of B.

They first show that if B is a subset of A, and B’ is a D[0]-expansion of B, then (B,B’) and (B’,B) are in D.  They then show that a subset B of A is D[i]-expansible  (and thus D-expansible) if and only if it is D[0]-expansible.

Then, they do the reduction from VC. A will have one attribute for each vertex in G.  D[0] (G&J’s F) will, for each vertex v, relate the set of v’s neighbors to v.  They show that a subset K of A is a key if and only if it’s a vertex cover of V.  They do this by induction: if K=A, it’s obviously a cover.  If K is smaller than A, then K is a key if and only if there is a D[0] expansion that is also a key. K has a D[0] expansion if and only if there is some vertex v not in K such that all of the vertices adjacent to v are reachable by K.  So, K is a key (and also a cover of A).

Difficulty: 8. There must be an easier way to explain this problem.

Rectilinear Picture Compression

This problem is apparently one of the more famous “unpublished manuscript” problems.  The reference in G&J is an unpublished manuscript by William Masek.  David Johnson mentioned in one of his NP-Completeness columns that Masek had “disappeared from the theory community”, but Johnson had the manuscript, and put out a call to Masek for permission to publish it as an article.  I guess that he never resurfaced, since I can’t find anything. I did find a different paper that explains the reduction though, so we’ll use that.

The problem: Rectilinear Picture Compression.  This is problem SR25 in the appendix.

Description: Given an NxN matrix M, consisting of 0’s and 1’s, and a positive integer K, can we find K or fewer rectangles in M that cover precisely all of the 1’s in M?

Example: Suppose M was:

1 0 0 1 1
1 0 1 1 1
0 0 1 1 0
0 0 1 1 1
1 0 0 0 1

Here is what I think the best way to draw rectangles is:

Notice that 1×1 rectangles are allowed, and by the way I read the problem definition you can have a cell that is in multiple rectangles.  (What you can’t have is a 0 inside a rectangle, or a 1 outside of a rectangle).

Reduction: The good news about Johnson’s NP-Competeness column is that it led me to a technical report by Conn and O’Rourke that explains most of the details of Masek’s reduction.  Which is a good thing, because it’s very complicated.  He basically defines shapes of 1’s as “wires” running through the matrix.  Here are the examples in the Conn and O’Rourke paper:

These wires will be used to represent boolean expressions (truth values are expressed as rectangles of 1’s: True is a 2×1 rectangle, false is a 1×2)  Masek then proves several lemmas about how many rectangles it takes to cover each figure, and gives an algorithm to show how to build a figure with the wire components for any boolean expression.  From the “how many rectangles does it take to cover each figure” lemmas, we know how many rectangles it will take to cover the figure if it is satisfiable, which becomes the K for the problem.

Difficulty: 9, maybe 10.  The Conn and O’Rourke paper doesn’t actually show the proofs of any of these lemmas, but I believe I could follow the reduction if I had access to the lemmas and the construction algorithm.

Regular Expression Substitution

This week’s problem is one of those “private communication” problems, so I had to create a reduction myself. I’m not confident it’s right, so let’s find out!

The problem: Regular Expression Substitution.  This is problem SR24 in the appendix.

The description:  Given two finite alphabets X and Y (possibly with different numbers of symbols, possibly with some overlap between the symbols), and a regular expression R over X∪ Y.  Also, we have one regular expression Ri over Y for each symbol in X, and a string in Y*.  Can we find a string z in the language of R, and a string wi in each of the Ri languages such that if we start with z, and replace each symbol xi with the string wi, the resulting string is w?

Example:  Here’s a pretty simple example.  Obviously you can use much more complicated regular expressions to make much more complicated things.

X = {a,b,c}.   Y = {1,2,3,4} w = 1231122334

R = X*Y (0 or more symbols from X, ending with a symbol from Y)

R1 = 1 + 11

R2 = 2+22

R3 = 3+33

z can be abcabc4

Each occurrence of a will be replaced by a string from R1 (1 the first time, 11 the second time), each occurrence of b will be replaced by a string from R2, and each occurrence of c will be replaced by a string from R3, getting us w back.

Reduction: G&J say to use X3C.  Here’s what I think I want to do:

We’re given an instance of X3C, so a set S (normally it’s X, but I don’t want to confuse it with the X in this problem) with 3q elements, and a collection C of 3-element subsets of X.  Our alphabet X will have one symbol for each element of each set in C (So an element will be something like ci,j for element i of set j.).  Our alphabet Y will have one symbol for each element in S.

z will be the string s0s1..s3q  (all the elements of S in some arbitrary order).

R will be the regular expression “Strings of length 3q over X that for each set in C either does not use that set at all, or uses all three elements in it exactly once”.  This is the part I’m not confident in.  I’m sure this is a description of a regular expression.  What I’m not sure of is whether this expression can be expressed in a form that is polynomial in the size of S and C.  I don’t know enough about how to minimize regular expressions to be able to answer that.

Anyway, from there, each expression Ri generates the single symbol in Y that is the corresponding element of the set in X (so the expression for the element in X c1,5 would be whatever the first element of set 5 in C is).

The idea is that you generate w by finding the string z  that has the correct elements of the correct sets. The rules for how you can make z give you constraints that your solution is a legal X3C solution.

Difficulty: 9.  Maybe it should be a 10, since I’m not really confident this is correct.  The real trouble I had coming up with a reduction was that strings in regular expressions have a fixed order, but the X3C problem just wants sets, which have no orderings.  So you need a way to either for the single string w to stand for any arrangement of symbols in S.  And while “uses each symbol in Y exactly once” is a regular expression, I’m not sure that can be written in polynomial time relative to the size of S and C either.  The only way I can think to do it offhand is by choosing between all (3q)! permutations of the symbols.

Internal Macro Data Compression

I’m finally back from my trip.  Apologies for the delay.

The problem: Internal Macro Data Compression.  This is problem SR23 in the appendix.

The description: Given an alphabet Σ, a string s in Σ*, a “pointer cost” h and a bound B, can we find a string C in the alphabet of Σ augmented with at most s “pointer characters” pi such that:

  • |C|+ (h-1)* (# of pointer characters in C) is ≤ B, and
  • We can replace pointer characters with substrings of C to regain s?

Example: This is similar to the last macro compression problem, but instead of having 2 strings C and D, C serves both purposes.  In our example from last time, we had s = “ABCDEFABCGABCDEF”, and came up with C= “pqpGpq” and D= “ABCDEF”.  The difference here is that there is no D string, we have to do it all in C.  We can do that by letting C = “ABCDEFqpGpq”.  Now if we let p=”ABC” (or, “the substring of C of length 3 starting at position 1” and q = “DEF” (or, “the substring of C of length 3 starting at position 4”, we get a string of cost 11+3h that gives us back s if we replace all pointers with the strings they represent.

Reduction: The paper by Storer that we’ve been using for the last few problems also has this reduction.  I think this is his “CPM” problem.  He uses the “K-Vertex Cover” problem, in the case where K=1.   Recall in that case we are looking for a set of edges E’ such that each edge in E shares a vertex with an edge in E’. We start with a graph, which is an instance of this problem.  The alphabet we build will have:

  • A special symbol $
  • One symbol for each vertex and edge in G

For each edge ei =  (vj, vk), define the substring Ei = $vj$vk$

Our s = the concatenation of eiEi for all edges i (So the symbol for the edge followed by the substring of the edge).  B = J + |S| – |E|.

The idea (and I’ll admit I have a hard time following this paper) is that if G has a 1-VC, then we have J edges that form a set E’ such that each edge in E is adjacent to something in E’.  This means that we can overlap those edges and make pointers out of the Ei strings to save enough characters.

Difficulty: 9.  I’m sure this can be explained better, but I really have a hard time following it.

External Macro Data Compression

These next problems are related “macro compression” problems.

The problem: External Macro Data Compression.  This is problem SR22 in the appendix.

The description: Given a string s over some alphabet Σ, a “pointer cost” h and a bound B, can we find two strings D and C over the alphabet of Σ augmented with some number (< |s|) of “pointer characters” pi such that:

  • |D| + |C| + (h-1)* (# of p characters in D and C) ≤ B
  • We can generate s by replacing pointer characters in C with their “meaning” in D.

Example: The concept they’re trying to get at isn’t that hard, it’s just hard to explain using mathematical language.  Suppose s = “ABCDEFABCGABCDEF”

Then we can define D to be “ABCDEF” and C to be “pqpGpq”.  The total size of this is 6 for D, plus 6 for C.  There are 5 pointers, so our total cost is 6+6+5h.

The idea is that the characters p and q are “pointers” that refer to substrings in D (p refers to “ABC” and q refers to “DEF”).  By replacing those pointers with what they “mean” in C, we can get back s.

A tricky part of this is that you are allowed to have substrings overlap.  So if s was “ABCDBCD” we could define D to be “ABCD” and C to be “pq” with p meaning “ABCD” and q meaning “BCD”.  Now our total cost is 4 for D, 2 for C, and 2 pointers, so 4+2+h.

Reduction:  The reduction (and the one for SR23) comes from a technical report by Storer, which is pretty dense.  I think we’re looking at “Theorem 2” in the paper, which is from VC<=3.

The alphabet that will be built from the VC instance has a lot of parts:

  • A special symbol $
  • 3 symbols vi, ai, and bi for each vertex vi in the graph
  • 4 more symbols fi,1,1 through fi,2,2 for each vertex vi in the graph
  • one symbol di for each edge ej in the graph.
  • 2 symbols c1 and c2 (there is actually one c symbol for each value of h.  We’ll assume h=2 here)
  • 3 symbols g1,1 g1,2 and g2,1 (This is also based on h.  So really you’d go from g1,1 through gh-1,2 and add gh,1)

The string s will also be built from parts (a lot of these are based on h has well, again, I’m fixing h=2 to keep it simpler)

  • Vi,l = ai$vi
  • Vi,2 = vi$bi
  • For each edge ei= (vj, vk), Ei = $vj$vj$
  • We also have Z1 = (c1)3 (so 3 copies of c1) and Z2 = (c2)3

s is:

  • Eidi  concatenated for each edge, followed by
  • Vi,jfi,j,k concatenated over each vertex and each possible f symbol, followed by
  • Z1g1,1Z1g1,2, followed by
  • Z2g2,1Z2

K’ = |s| + K – (7/2)|V|.

The basic idea from here is that if G has a VC, we can compress s by making pointers for (among other things) the combination of Vi,1fi,1,2Vi,2 and the combination of Vi,2fi,2,2Vi+1,1  where vertex vi is in the VC.  This lets us use the pointers both for the part of the string with the V’s and f’s, but also for the $vi$ strings in the E components, saving space.

In the other direction, if we have a compressed string of size K’, he shows that means that you have compress a string like the above and the overlaps show you the vertices in the cover.

Difficulty: 8.  I think to really get what’s happening, you need to actually build an example on a graph.  But I think the idea of building the sets so that that is the overlap you’re looking for is something that can be gotten eventually.

Protected: String-To-String Correction

This content is password protected. To view it please enter your password below:

Protected: Two-Dimensional Consecutive Sets

This content is password protected. To view it please enter your password below: