Here’s another problem where the G&J definition confused me for a bit.

**The problem: **Expected Retrieval Cost. This is problem SR4 in the appendix.

**The description: **Given a set R of records, each with a probability of being accessed between 0-1 (and the sum of all probabilities = 1), some number m of sectors to place records on, and a positive integer K. Can we partition R into m disjoint subsets R_{1}..R_{m} such that:

- The “latency” cost of 2 sectors i and j, called d(i,j) is j-i-1 (if i < j) or m-i+j-1 (if i >=j)
- The probability of a sector, called p(R
_{i}), is the sum of the probabilities of the records on that sector - The sum over all pairs of sectors i and j is p(R
_{i}) * p(R_{j}) * d(i,j) is K or less

**Example: **The thing that was the hardest for me to understand was the definition of d. The way it’s written, the distance between 2 adjacent sectors (for example d(2,3)) is 0. The distance between a sector and itself (for example d(2,2)) is m-1. The paper by Cody and Coffman do a better job of explaining the motivation: What we’re looking at is the time (in sectors traversed) for a disk to read sector j after finishing reading sector i. So If we read sector 2 right before reading sector 3, the disk has no traversal time to go from the end of sector 2 to the beginning of sector 3. But if we read sector 2 twice in a row, the disk reader (in this model) needs to scan to the end of all of the sectors, then return to the beginning, then scan all the way to the beginning of sector 2 to read again.

So, suppose we have m=2, and 4 records, each with .25 probability. If we put them all in the same sector, we have d(i,j) = 1 for all pairs of sectors. Since all pairs of sectors are in (say) R_{1}, then p(R_{1}) = 1, and p(R_{2}) = 0. So our sum is:

- p(R
_{1})*p(R_{1})* d(1,1) = 1*1*1 = 1, plus - p(R
_{1}) * p(R_{2}) * d(1,2) = 1*0*0 = 0, plus - p(R
_{2}) * p(R_{1})* d(2,1) = 0*1*0 = 0, plus - p(R
_{2})* p(R_{2}) * d(2,2) = 0*0*1

..for a total of 1.

If we put 2 records in sector 1, and 2 records in sector 2, then p(R_{1}) = p(R_{2}) = .5. So our sum is:

- p(R
_{1})*p(R_{1})* d(1,1) = .5*.5*1 = .25, plus - p(R
_{1}) * p(R_{2}) * d(1,2) = .5*.5*0 = 0, plus - p(R
_{2}) * p(R_{1})* d(2,1) = .5*1.5*0 = 0, plus - p(R
_{2})* p(R_{2}) * d(2,2) = .5*.5*1 = .25

..for a total of .5.

**The reduction: ** Hopefully the example using m=2 helps to show why using Partition is a good choice. So we start with a set S of elements. We will turn each element S into a value between 0 and 1 reflecting its proportion of the sum of all of the elements. For example, if S={1,2,3,4,5}, then we would create a set R of values {1/15, 2/15, 3/15, 4/15, 5/15}. These probabilities will all be between 0 and 1 and will all sum to 1.

We will set m=2, K = 1/2. Notice that d(1,2) = d(2,1) = 0. So the only d values that will count for our sum is d(1,1) and d(2,2) (which are both 1) So by our formula we need p(R_{1}) * p(R_{2}) + p(R_{2}) * p(R_{1}) = .5.

Some algebra tells us that this means that p(R_{1})*p(R_{2}) = ..25, and we know that p(R_{1}) + p(R_{2}) = 1. Solving that system of equations gets us p(R_{1}) = p(R_{2}) = .5. Or, we have an Expected Retrieval Cost solution for R exactly when we have a partition of S.

**Difficulty: **4. Cody and Coffman say the details of the above reduction are “routine” after defining k = 1/2. It is pretty straightforward, but there are some tricky parts to worry about.

I will say though that the definition in G&J, where it’s not clear how distances to adjacent things can be 0, struck me as much harder, and is the reason I dug up the Cody and Coffman paper in the first place. I’d say that definition makes the problem a 6 or 7.