The problem: Linear Bounded Automaton Acceptance. This is problem AL3 in the appendix.
The description: Given a Linear Bounded Automaton L, and a string x, does L accept x?
Example: A well-known context-sensitive language is L= a^{n}b^{n}c^{n}. If we had a definition for an LBA that accepts a string L, and a string such as aabbccc, the LBA should say “no” on this instance. But if our string was aabbcc, the LBA would say “yes”.
Reduction: G&J say this is a “Generic reduction”, and I can see why. Let me try to add some explanation and put it into something more like what we’re used to:
We start with a known NP-Complete (not NP-Hard) problem. Let’s say 3SAT. 3SAT is in NP, which means we have a non-deterministic Turing Machine that takes an instance x and solves 3SAT in polynomial time, let’s say p(|x|). Let’s assume that the alphabet of this Turing Machine is {0,1}.
Given this instance x of 3SAT, we build the following context-sensitive language (and it’s implied that we can go from there to an LBA in polynomial time). {#^{p(log(|x|)} x #^{p(log (|x|)}} over the alphabet {0,1,#}. So we have a number of # symbols before and after x equal to the polynomial on the log of the input.
To be honest, I’m not sure why you need the log here. I think you can get away with just having p(|x|) symbols on each side, and the total length of the LBA acceptance instane is still polynomial in |x|. The idea is that since we know that the TM completes in time p(|x|), it can only ever move its head p(|x|) tape symbols to the left or right before it runs out of time. So we can use that as the “bound” on the LBA.
So, if our LBA uses the exact same states and transitions as the non-deterministic Turing Machine that solved 3SAT, we now have our LBA accept x exactly when x was satisfiable.
The reason this is a “generic” reduction is that nothing we did had anything to do with 3SAT specifically. We could do this process for any problem in NP. It’s more useful if we start with an NP-Complete problem, but we could do this for things in P as well since they are also in NP.
Difficulty: 5, mainly because this is a weird way of doing things, and Linear Bounded Automata are things that often get short shrift in Theory of Computation classes.
]]>The problem: Consistency of Database Frequency Tables. This is problem SR35 in the appendix.
The description: I find G&J’s definition confusing, so this definition borrows a lot from the paper by Reiss that has the reduction.
We have a set of “characteristics” (attributes) A. Each attribute a in A has a domain D_{a}. A database is a set of “Objects” (tuples) O_{1}..O_{n} where each object defines a value for each characteristic. (The result is a two-dimensional table where rows are objects and columns are attributes). We can define a frequency table for each pair of attributes a and b in A. The table has |D_{a}| rows and |D_{b}| columns, and each entry (x,y) is “supposed” to represent the number of tuples in O that have x for its A attribute and y for its B attribute.
What the problem is asking is: Given a set of tables and a database table V, can we find a way to map the attributes in A to the tables such that the tables actually represent the frequencies in the database?
Example:
Since you need a frequency table for each pair of attributes, here is an example with 3 attributes, each taking 2 posible values. Attribute a’s domain is {0,1}, b’s is {a,b}, and c’s is {>, <}. Our set of objects is:
If we are handed a set of 3 frequency tables:
C_{1} vs C_{2}:
1 | 0 |
2 | 1 |
C_{1} vs C_{3}:
0 | 1 |
1 | 2 |
C_{2} vs C_{3}:
1 | 2 |
0 | 1 |
These frequency tables are accurate if C_{1} is attribute a, C_{2} is attribute b, and C_{3} is attribute c.
The reduction: From 3SAT. (Actually, this might be from regular CNF-SAT, but there is no reason not to restrict the input to 3 literals per claue). We take a SAT instance with p clauses and q variables and bake a database with 2 objects per variable (one for positive, one for negative) and p*q extra objects. We have one attribute per clause (“CL_{i}“), one attribute per variable (“VR_{i}“) and “VALUE” (which holds the truth value of the object in the final setting). The frequency tables are set up to ensure that each variable has one value setting(true or false) and each clause is made true. The way to make the database consistent with the table is to find a way to map the variables in the database to the tables in a way to make the formula satisfiable.
Difficulty: 5, because the reduction isn’t that hard to follow, or come up with, once you get how the frequency tables work. But unlike most 5’s, I don’t think I’d assign it as a homework problem, because of how much extra work it would take to explain the frequency tables in the first place.
]]>The problem: Safety of Database Transaction Systems. This is problem SR34 in the appendix.
The description: Given a set of variables V and transactions T as defined in the Serializability of Database Histories problem, is every history H for T equivalent to some serial history?
Example: This is an extension of last week’s problem. So last week’s example which produced a history that is not serializable means that that set of transactions is not safe.
The easiest way to produce transactions that are safe is to make them access different variables: Suppose I have 2 transactions (R_{1}, W_{1}) and (R_{2}, W_{2}). If the first transaction reads and writes a variable x and the second transaction reads and writes a variable y, then any ordering of those transactions will be serializable.
Reduction: It’s in the same paper by Papadimitriou, Bernstein, and Rothnie as the Serializability of Database History problem was. It’s interesting that they couldn’t show that the problem was in NP.
They reduce from Hitting Set. They show in the paper how to take a transaction system and build a graph where there is one vertex for each read and write operation in each transaction, and edges between the two operations in a transaction. There are also edges between operations R_{i} and W_{j} or W_{i} and W_{j} if those operations share a variable. These edges show places where changing the order changes the meaning of a transaction history. They show that a transaction system is safe if and only if the graph has no cycles containing a (R_{j}, W_{j}) edge. (Note that means that cycles can exist as long as they contain only W-W edges)
So, given an instance of hitting set, (a set S and a collection C of subsets of S), they build a transaction graph: One read vertex for each set in S, plus one write vertex at the end. Between read vertices R_{i} and R_{i+1} we add |C_{i}| edges (or, so we still have a simple graph, |C_{i}| paths containing vertices that don’t appear anyplace else). At the end of this chain of paths is a single W vertex, with an edge back to R_{1}. The only unsafe cycle now starts at R_{1}, goes through one of the paths connecting each R vertex, goes to the final W vertex, and then back to R_{1}.
So far, so good. But then they lose me when they say “We can embed the hitting set problem –among others– in safety by forcing (by the use of sets of singletons) each such choice to correspond to a hitting set”. I think what they’re saying is that they will create a set of variables corresponding to sets in C such that an unsafe cycle exists if and only if S has a hitting set. But I’m not sure how they get there- especially in polynomial time. I’m sure there’s a way, but it reads like “we set it such that it all works”, which isn’t convincing to me.
Difficulty: 9, because I don’t see how they do that last step. I’m sure a good explanation exists that would make this less difficult. I’ll also say that the reduction also says the transaction system is unsafe “if and only if there exists an unsafe path–and therefore a hitting set”. Which sounds like a Co-NP proof to me. I’m probably missing something.
]]>The problem: Serializability of Database Histories. This is problem SR33 in the appendix.
The description: We have a set V of variables in our database, and a set T of transactions, where each transaction i has a read operation (R_{i}) that reads some subset of V, and a write operation W_{i} that writes some (possibly different) subset of V. We’re also given a “history” H of T, which permutes the order of all reads and writes maintaining the property that for all i, R_{i} comes before W_{i} in the history. Think of this as a set of parallel transactions that reach a central database. H is the order the database processes these operations.
Can we find a serial history H’ of T, with the following properties:
Example: The paper by Papadimitriou, Bernstein, and Rothnie that has the reduction has a good simple example of a non-serializable history:
H= <R_{1}, R_{2}, W_{2}, W_{1}>, where R_{1} and W_{2} access a variable x, and R_{2} and W_{1} access a variable y. Both transactions are live since they both write their variables for the last time. Notice that neither transaction reads any variable. But the two possible candidates for H’ are: <R_{1}, W_{1}, R_{2}, W_{2}> (where R_{2} reads the y written by W_{1}) and <R_{2}, W_{2}, R_{1}, W_{1}> (where R_{1} reads the x written by W_{2}), so neither H’ candidate has the same set of transactions reading variables from each other.
Reduction: Is from Non-Circular Satisfiability. Given a formula, they generate a “polygraph” of a database history. A polygraph (N,A,B) is a directed graph (N,A) along with a set B of “bipaths” (paths that are 2 edges long). If a bipath{(v,u), (u,w)} is in B, then the edge (w,v) is in A. So, if a bipath exists in B from v to w, then an edge in A exists from w back to v. This means that we can view a polygraph (N,A,B) as a family of directed graphs. Each directed graph in the family has the same vertices and an edge set A’ that is a superset of A and contains at least one edge in each bipath in B. They define an acyclic polygraph as a polygraph (represented as a family of directed graphs) where at least one directed graph in the family is acyclic.
In the paper, they relate database histories to polygraphs by letting the vertex set N bet a set of live transactions. We build edges in A (u,v) from transactions that write a variable (vertex u) to transactions that read the same variable (vertex v). If some other vertex w also has that variable in their read set then the bipath {(v,w), (w,u)} exists in B. So edges (u,v) in A mean that u “happens before” v since u writes a variable that v reads. A bipath {(v,w), (w,u)} means that w also reads the same variable, so must happen before u or after v. They show that a history is serializable if and only if the polygraph for the history is acyclic.
So, given a formula, they build a polygraph that is acyclic if and only if the formula is satisfiable. The polygraph will have 3 vertices (a_{j}, b_{j}, and c_{j}) for each variable x_{j} in the formula. Each a vertex connects by an edge in A to its corresponding B vertex. We also have a bipath in B from the b vertex through the corresponding c vertex back to the a vertex.
Each literal C_{ik} (literal #k of clause i) generates two vertices y_{ik} and z_{ik}. We add edges in A from each y_{ik} to z_{i(k+1)mod 3} (in other words, the y vertex of each clause connects to the “next” z vertex, wrapping around if necessary). If literal C_{ik} is a positive occurrence of variable X_{j}, we add edges (c_{j}, y_{ik}) and (b_{j}, z_{ik}) to A, and the bipath {(z_{ik}, y_{ik}), (y_{ik}, b_{j})} to B. If the literal is negative, we instead add (z_{ik}, c_{j}) to A and {(a_{j}, z_{ik}), (z_{ik}, y_{ik})} to B.
If the polygraph is acyclic (and thus the history it represents is serializable), then there is some acyclic digraph in the family of directed graphs related to the polygraph. So the bipath {(b_{j}, c_{j}), (c_{j}, a_{j})} will have either the first edge from b-c (which we will represent as “false”) or will have the second edge from c-a (which we will represent as “true”). (The directed graph can’t have both because its edges are a superset of A, which means it has the edge (a_{j}, b_{j}) and taking both halves of he bipath will cause a cycle).
If our acyclic directed graph has the “false” (b-c) version of the edge for a literal, then it also has to have the z-y edge of the bipath associated with the literal (otherwise there is a cycle). If all of the literals in a clause were set to false, this would cause a cycle between these bipath edges and the y-z edges we added in A for each clause. So at least one literal per clause must be true, which gives us a way to satisfy the formula.
If the formula is satisfiable, then build the acyclic digraph that starts with all of A, and takes the bipath edges corresponding to the truth value of each variable, as defined above. This implies ways you need to take the edges from the bipaths for the literals, to avoid cycles. The only way now for the graph to be acyclic is for there to be a cycle of x’s and y’s in the edges and bipath edges. But that would imply that we’ve set all of the literals in a clause to false. Since we know that the clause can be made true (since the original formula is satisfiable), we know that a way exists to make the directed graph acyclic.
Difficulty: 7. It takes a lot of baggage to get to the actual reduction here, but once you do, I think it’s pretty easy and cool to see how the cycles arise from the definitions of the graphs and from the formula.
]]>The problem: Non-Circular Satisfiability. This problem is not in the appendix.
The description: A CNF-Satisfiability clause is mixed if it contains both variables and their negations. A Satisfiability formula is non-circular if each variable occurs in a mixed clause at most once. Given a non-circular satisfiability formula, is it satisfiable?
Example: This is actually a more general version of Monotone Satisfiability– in Monotone Sat, no mixed clauses can exist. So all Monotone formulas are also Non-Circular.
Notice that we’re not necessarily restricting things to 3Sat clauses- we can have more or less than 3 variables per clause. So here is a (satisfiable) instance:
F = (x_{1} ∨ x_{2}) ∧ (~x_{1} ∨ ~x_{2}) ∧ (~x_{1} ∨ x_{2})
This formula has just one mixed clause (the third one) and is satisfiable if x_{1} is false and x_{2} is true. The formula becomes unsatisfiable if we add the clause (x_{1} ∨ ~x_{2}), but then we would have both variables be in a second mixed clause.
Reduction: The easy way is to just do this by restriction from Montone Sat. (Given a monotone sat formula, it’s automatically non-circular, so we’re done). But if we want an actual reduction, we can use the paper by Papadimitriou, Bernstein, and Rothnie which has the next two reductions in G&J and uses this problem for the first one. They also show the reduction for this problem in their paper and reduce from regular CNF-SAT. So we have a formula F. For each variable x in F, suppose x appears m times in F. Create m new variables x_{1} through x_{m}. The first instance of x in F will be replaced by x_{1}, the second by ~x_{2}, the third by x_{3}, and so on down. We then need to basically add the rules x_{1} ≡~x_{2} ≡ x_{3} … (forcing each new literal we replaced x with to all have the same truth values). in CNF form, that’s equivalent to the clauses: (x_{1} ∨ x_{2}) ∧ (~x_{2} ∨ ~x_{3}) ∧ (x_{3} ∨ x_{4}) .. and so on down. This means our new formula is equivalent to F (and is satisfiable when F is). Each of these new clauses is non-mixed. The only other occurrence of each x_{i} variable is the one time it replaces x in the original F, which may or may not be a mixed clause, but either way, that means each variable appears in at most one mixed clause.
Difficulty: 3. This is about as hard as the Monotone Sat reduction.
]]>The problem: Tableau Equivalence. This is problem SR32 in the appendix.
Definitions: Given a set of attributes A, and set F of ordered pairs of subsets of A (functional dependencies). Most database queries ask for certain attributes (these are the “distinguished variables” in the G&J definition) that fill requirements defined by other attributes and values (anything new added are the “undistinguished variables” in the G&J definition).
A Tableau is a matrix that represents these attributes and variables. The columns correspond to attributes, and the rows correspond to tuples that are returned by the query. The first row of the tableau is the “summary” of the tableau and holds the distinguished variables we want to return (and possibly some constants)
Tableau example: This example comes from p. 223 of the paper. For the query:
Find all values for variables a_{1 }and a_{2} such that we can find values for variables b_{1} through b_{4} such that the following strings are all in our database instance (called “I” in the paper): By convention in the paper, a variables are distinguished variables, and b variables are undistinguished.
If I was the strings {111,222, 121}, then all assignments of 1’s and 2’s to a_{1} and a_{2} work:
Before I get to the 2 harder cases, let me show the tableau:
A | B | C |
a_{1} | a_{2} | |
a_{1} | b_{1} | b_{3} |
b_{2} | a_{2} | 1 |
b_{2} | b_{1} | b_{4} |
The first row lists the attributes we’re considering (a_{1} comes from A and only occurs in the first spot in the result string, a_{2} comes from B and only occurs in the second spot in the result string. Our query doesn’t want any variables from C. The summary lists the variables from the attributes that form our distinguished variables.
The rows below the summary show how to build legal strings (like my list above).
So now we can use the tableau to help see how to find legal values to the variables:
The problem: Given two Tableaux T_{1} and T_{2} which share the same A, F, X, and Y sets (as defined above), are they equivalent? That is, do they generate the same sets of legal values for their distinguished variables for all possible I sets?
Example: This gets tricky, partially because of the need to worry about “all possible I sets”, and partially because adding functional dependencies makes things equivalent in subtle ways. Here is the example from page 230 of the paper:
A | B | C | D |
a_{1} | a_{2} | a_{3} | a_{4} |
a_{1} | b_{1} | b_{2} | b_{3} |
b_{4} | b_{1} | a_{3} | b_{5} |
a_{1} | a_{2} | b_{6} | b_{7} |
a_{1} | b_{8} | b_{9} | a_{4} |
If we have the functional dependencies B->A and A->C are true. Then all strings with b_{1} in the second character must have the same value (a_{1}) in the first character. Similarly, A->C implies that the entire third column can be replaced by a_{3}. This gives us the equivalent tableau:
A | B | C | D |
a_{1} | a_{2} | a_{3} | a_{4} |
a_{1} | b_{1} | a_{3} | b_{3} |
a_{1} | b_{1} | a_{3} | b_{5} |
a_{1} | a_{2} | a_{3} | b_{7} |
a_{1} | b_{8} | a_{3} | a_{4} |
And also, if the only difference between two rows is that one row has nondistinguished variables that don’t appear anywhere else, that row can be eliminated. So we can get rid of the first row (with b_{3}) because it is just like the second (with b_{5}) But then we realize that the a_{1}b_{1}a_{3}b_{5} row only differs from the row below it in b_{1} and b_{5} which now only appear in that row. So we can remove that row too, to get the equivalent:
A | B | C | D |
a_{1} | a_{2} | a_{3} | a_{4} |
a_{1} | a_{2} | a_{3} | b_{7} |
a_{1} | b_{8} | a_{3} | a_{4} |
Reduction: From 3SAT. Given a formula, we build 2 tableaux. The set A will have 1 attribute for each clause and variable in the formula. The clause variables will be distinguished variables. Our T_{1} tableau will be set up as follows:
For each clause, we will set a common undistinguished variable for the 3 variables in the row (so each clause that has that variable will have the same undistinguished variable in that column), and separate (only used once) undistinguished variable in the other columns.
The T_{2} tableau will have 7 rows for each row in T_{1}. In T_{2} we replace the common undistinguished variables with 7 sets constants(0 or 1) that are the ways to set the variables to make the clause true.
They prove a bunch of lemmas after this, but it boils down to: If we have a truth assignment for the formula, we can map that to the tableau by setting the common variables in T_{1} to the truth values in the assignment. All clauses will be true in both T_{1} and T_{2}. If the tableaux are equivalent, then we must have found a way to set those common variables, and that gives us a truth assignment for the formula.
Difficulty: 7. This isn’t that hard of a reduction. Even the lemmas aren’t too hard, though they do depend on a paper’s worth of previous results in equivalences (like the functional dependency thing I did in the example). But there is a ton of definitions to get through before you can start.
]]>The problem: Conjunctive Boolean Query. This is problem SR31 in the appendix.
The description: Given a conjunctive query in the format described last week, is the query true? (That is, can we find any tuples to satisfy the query?)
Examples: Here is the “List all departments that sell pens and pencils” query from last week:
(x). Sales(x, pen) and Sales(x, pencil).
This problem would return true if there was an actual tuple in the database that could bind to x, and false otherwise.
Reduction: The paper by Chandra and Merlin that we used last week has the definition of this problem, but just says the NP-Completeness “follows from, say, the completeness of the clique problem for graphs”
But I think the reduction is pretty easy to spell out. If we’re given an instance of Clique (a graph G and an integer K), we can just build the query:
“Does there exist k vertices x_{1} through x_{k} such that all edges between any two vertices in the set exist?”
We can implement this as a database by creating a relation for all edges, and then the conjunctive query will have at most O(V^{2}) clauses.
Dififculty: 3, but it’s a lot of work to understand the database definitions to get to the actual problem.
]]>The problem: Conjunctive Query Foldability. This is problem SR30 in the appendix.
The description: I find G&J’s definition very confusing, I think because they don’t have room to define what the various variables mean in terms of databases (they actually send you to the paper, as if they know it was confusing). The definitions in the paper by Chandra and Merlin that has the reduction do a good job of giving names to things and building up slowly. So:
They define a Relational Data Base as a finite domain D, and a set of relations R_{1} through R_{s}. Each relation is an ordered tuple of elements of D.
A first-order query is of the form (x_{1}, x_{2}, …x_{k}). Φ(x_{1}, x_{2}, ..x_{k}) where Φ is a first-order formula whose only free variables are x_{1} through x_{k}
The result of a query on a database B with domain D is a set of k-tuples {(y_{1}..y_{k}) ∈ D^{k }such that Φ(y_{1}..y_{k}) is true in B}.
A conjunctive query is a first-order query of the form (x_{1}, ..x_{k}).∃ x_{k+1},x_{k+2}, …x_{m}. A_{1}∧A_{2}..A_{r}, where each A_{i} is an atomic formula asking about a relation in a database, and each element in the relation is a variable form x_{1} to x_{m} or a constant.
Examples of conjunctive queries: Here are some examples from the paper, so we can see how this relates to databases:
“List all departments that sell pens and pencils”. If we have a relation Sales(Department, Item) (“Department” and “Item” are placeholders for variables), the conjunctive query is:
(x). Sales(x, pen) and Sales(x, pencil).
In this case, there are no new variables (or ∃) added in the query, and “pen” and “pencil” are constants.
“List all second-level or higher managers in department K10” (People who are in X10 who manage someone who manages someone else). If we have a relation Emp(Name, Salary, Manager, Dept), the conjunctive query is:
(x_{1}).∃(x_{2}, …, x_{9}) .Emp(x_{2}, x_{3}, x_{4}, x_{5}) ∧Emp(x_{4}, x_{6}, x_{1}, x_{7}) ∧ Emp(x_{1}, x_{8}, x_{9}, K10)
In this query, x_{1} is the “answer”, but depends on the existence of the other x variables. In particular, x_{4}, who is the employee x_{1} manages, and x_{2}, who is the employee x_{4} manages.
Ok, now the actual foldability definition (this is written less formally than either the paper or G&J):
Given two conjunctive queries Q_{1} and Q_{2}, can we create a function σ mapping all constants and variables x_{1} through x_{k} to themselves (and possibly mapping other variables to constants or variables from x_{1} to x_{k} as well), where if we replace each variable x in Q_{1} with σ(Q_{1}), we get Q_{2}?
Example: Here is a “worse” version of the first example above:
(x).Sales(x, x_{pen}) ∧ Sales(x,x_{pencil}) ∧ x_{pen}=pen ∧ x_{pencil} = pencil
This can be “folded” into:
(x).Sales(x, pen) ∧ Sales(x,pencil) ∧ pen=pen ∧ pencil = pencil
..it still has those redundant equalities at the end, though, but I don’t see a way to use the definitions of foldability to get rid of them. I think that the definition given in Chandra and Merlin’s paper might let you do it because their function maps the “natural model” of a query, which I think includes relations.
Reduction: From Graph Coloring. We start with a graph and an integer K. We build one variable for each vertex, and K additional (separate) vertices, in a set called C. The relational model R relates vertices connected by an edge.
Q_{1} is:
Q_{2} is
So what I think we’re trying to do is “fold” the variables in V into the variables in C, so that there is still a relation between each edge. That only works if the vertices on each side of the edge are different colors because otherwise.the relation won’t show up (because of the constraint). But I’m not sure, and the proof in the paper stops after building this construction.
Difficulty: 9. I’m pretty sure a lot of this is going over my head.
]]>The problem: Additional Key. This is problem SR27 in the appendix.
The description: Given a set A of attributes, and a set F of functional dependencies, a subset R of A, and a set K of keys on R, can we find an additional key on R for F? (Which is an additional subset R’ of R, that is not in K, that serves as a key for F)
Example: This example is from the Beeri and Bernstein paper that had last week’s reduction, and has this week’s as well:
If R = A = {A,B,C}, and K = the single key {A,B}, then {A,C} is an additional key.
Reduction: The Beeri and Bernstein paper uses a similar construction from last week’s. Again, we reduce from Hitting Set.
They build the same construction as in the BCNF violation reduction. Recall that the set was set up so that there was a BCNF violation if there was a hitting set, and that violation happened because some but not all attributes were functionally dependent on the hitting set. So if we add those extra attributes to the set of attributes in the hitting set, we get an additional key.
That’s the basic idea, but the paper has some parts I have trouble following (there are some notation issues I can’t parse, I think. For example, it’s not immediately obvious to me what K is for this problem. I think it’s the set of new attributes, but I’m not sure.
Difficulty: 9 One step harder than the last problem, since it builds on it.
]]>The problem: Boyce-Codd Normal Form Violation. This is problem SR29 in the appendix.
The description: Given a set of attributes A, a collection F of functional dependencies, and a subset A’ of A, does A’ violate Boyce-Codd normal form?
G&J’s definition of BCNF is: A’ violates BCNF if we can find a subset X of A’ and 2 attributes y and z in A’-X such that the pair (X,y) is in the closure of F, and (X,z) is not in the closure of F.
The paper by Beeri and Bernstein define BCNF as: for all disjoint nonempty sets of attributes X and Y in A, If Y is functionally dependent on X, then all attributes in A depend on X.
Example: Here is an example from the Beeri and Bernstein paper that has the reduction:
Let A = {Apartment_Type, Address, Size, Landlord, Apartment_Num, Rent, Tenant_Name}
(I’m pretty sure “Apartment_Type” means something like “Two Bedroom”, “Size” is in square feet, “Address” is the street address of the apartment building, at which there are potentially many apartments each with a different “Apt_Num”)
With these functional dependencies:
If A’ is {Name, Address, Apt_Num, Rent} then BCNF is violated because Rent is functionally dependent on Name (through the pair of attributes {Address, Apartment_Num}, but other attributes (for example, Size) is not.
Reduction: Beeri and Bernstein reduce from Hitting Set. So we start with a set S, and a collection C of subsets of S, and an integer K. We need to create a set of attributes and functional dependencies.
Each element in S will be an attribute in A. If we have a hitting set S’ of the attributes, then we could create a BCNF violation by making some of the attributes (but not all) of A functionally dependent on S’.
So we create a new attribute for each subset in C. We create dependencies between the element attribute of each element in each C and its “set” attribute. Now if we have a hitting set S’, then that set is functionally dependent on each element attribute. So then if we make some but not all attributes in A dependent on the set of all element attributes, we get a BCNF violation.
That’s the basic idea, but the actual proof has a lot of special cases to worry about and resolve.
Difficulty: 8. The general idea isn’t too hard (though you have to learn the definition of BCNF to even get started), but there is a ton of details and special cases to worry about to do it right.
]]>