Description

The BitVectors are an ancient and immortal race of 10,000, each with a 10,000 bit genome. The race evolved from a single individual by the following process: 9,999 times a BitVector chosen at random from amongst the population was cloned using an error-prone process that replicates each bit of the genome with 80% fidelity (any particular bit is flipped from parent to child 20% of the time, independent of other bits).

Write a program to guess the reproductive history of BitVectors from their genetic material. The randomly-ordered file bitvectors-genes.data.gz contains a 10,000 bit line for each individual. Your program's output should be, for each input line, the 0-based line number of that individual's parent, or -1 if it is the progenitor. Balance performance against probability of mistakes as you see fit.

To help you test your program, here is a much smaller 500 x 500 input dataset: bitvectors-genes.data.small.gz, along with its solution file: bitvectors-parents.data.small.

(This puzzle description is © ITA Software.)

Solution

Before getting started on the problem itself, I need a few utilities to read the data files and convert the contents to an appropriate representation.

(defun string->bit-vector (string) "Convert a STRING of 0s and 1s to a bit vector." (let ((bits (make-array (length string) :element-type 'bit :initial-element 0))) (dotimes (i (length string) bits) (setf (bit bits i) (if (char= (char string i) #\0) 0 1))))) (defun file->bitvector-array (filename) "Create an array of bit vectors from the data in FILENAME." (let ((lines (with-open-file (file filename) (do ((lines (list)) (line (read-line file) (read-line file nil 'eof))) ((eq line 'eof) lines) (push (string->bit-vector line) lines))))) (make-array (length lines) :element-type 'bit-vector :initial-contents (nreverse lines))))

A quick test of each function shows them to be working as intended:

(defparameter *example-bit-string* "10100011100110101110110100101110000010110010011000011010011010101001100011000011100111100110101001110001010101000000010100101001100010000111111110111010010010001110100001110101000010011001000000001100001011011010110010110111010101000110101101101100111111101010000110010110011101111100000100100110000001110010100001011111000110111010110111000010100000011111101010010010010011000001000001110101011101010111110100100110010110110101010110100100101010111110000110000000110100010101011110011011000111001011") (string->bit-vector *example-bit-string*) #*10100011100110101110110100101110000010110010011000011010011010101001100011000011100111100110101001110001010101000000010100101001100010000111111110111010010010001110100001110101000010011001000000001100001011011010110010110111010101000110101101101100111111101010000110010110011101111100000100100110000001110010100001011111000110111010110111000010100000011111101010010010010011000001000001110101011101010111110100100110010110110101010110100100101010111110000110000000110100010101011110011011000111001011 (defparameter *small-array* (file->bitvector-array "bitvectors-genes.data.small")) (array-dimension *small-array* 0) 500

The problem requires understanding how big a difference there is between two bit strings, so I need to count the number of non-equivalent bits in two bit vectors. This can be expressed very cleanly in Common Lisp:

(defun bit-difference (x y) "Return the number of bits different between bit vectors X and Y." (count 1 (bit-not (bit-eqv x y)))) (bit-difference #*110011 #*110000) 2

Now it's time to think a little more about the details of the problem. First a sanity check to make sure this can be solved in the available memory. The full problem set contains 10,000 strings of 10,000 bits each for a total of 100 million bits. Even assuming significant overhead, that doesn't pose a problem for the 2GB on my laptop. I'll worry about the CPU required once I've got the algorithm defined.

The next step in developing a solution is to determine whether or not a parent-child relationship exists between two bit vectors. According to the problem definition, every bit vector other than the original is created by randomly mutating the elements of an existing bit vector. The probability of a particular bit changing is 20%. That means that, on average, the children of a bit vector should be 80% identical to their parent and the grandchildren should be 64% similar. Of course, the random element means that some will have more than 80% similarity and others less. Some children may be more similar to each other than either is to their parent. The best result any automated solution will be able to achieve is the most probable relationship tree.

Assuming that a fair random number generator was used, the set containing the number of bits different between a parent and each of its children should follow a binomial distribution. The probability of two vectors having a parent child relationship is related to how close the difference between the two is to the expected value (20%, in this case). In a binomial distribution (which approximates a continuous normal distribution), 99.73% of the bit difference values will be within three standard deviations of the expected value. The standard deviation is computed by:

(defun binomial-standard-deviation (n p) "Compute the expected binomial standard deviation for N events of probability P." (sqrt (* n p (- 1 p))))

Now I can write a predicate to determine if two bit vectors are closely related. It's important to note that this predicate only indicates whether or not there is a parent-child relationship between the two vectors, not the direction of that relationship.

(defconstant +mutation-rate+ 0.2 "The mutation rate when creating a child vector.") (defun parent-child-p (vector-x vector-y &optional (mutation-rate +mutation-rate+) (deviations 3)) ;; 3 = 99.73 percent "Return true if VECTOR-X and VECTOR-Y have a parent-child relationship. The relationship is assumed if the bit difference between the two is less than DEVIATIONS standard deviations from the expected average for the MUTATION-RATE." (let* ((size (length vector-x)) (expected-difference (* size mutation-rate)) (standard-deviation (binomial-standard-deviation (length vector-x) mutation-rate)) (max-difference (+ expected-difference (* deviations standard-deviation)))) (< (bit-difference vector-x vector-y) max-difference)))

The next question is, if I know all the parent-child relationships, how do I determine the direction of those relationships? The answer is iterative:

First, I need to determine the full set of relationships:

(defun make-related-matrix (bitvectors) "Create a matrix of bits indicating if two bit vectors in BITVECTORS have a parent-child relationship." (let ((matrix (make-array (list (length bitvectors) (length bitvectors)) :element-type 'bit :initial-element 0))) (dotimes (i (length bitvectors) matrix) (dotimes (j (length bitvectors)) (when (and (not (= i j)) (parent-child-p (aref bitvectors i) (aref bitvectors j))) (setf (aref matrix i j) 1))))))

This is almost the last piece required to find the inheritance hierarchy. One more small utility I need is something to slice rows out of an array:

(defun row-view (matrix index) "Return a vector referencing the values of the INDEXth row in the two dimensional array MATRIX." (make-array (second (array-dimensions matrix)) :element-type (array-element-type matrix) :displaced-to matrix :displaced-index-offset (array-row-major-index matrix index 0)))

In a fit of premature optimization, I'm going to add a couple of utilities that will allow me to make only one full pass through the relationship matrix, after which I'll only look at the rows that were modified by the last pass.

(defun high-bits (bitvector) "Return a list containing the index of every 1 bit in BITVECTOR." (let ((indices (list))) (dotimes (i (length bitvector) (nreverse indices)) (when (= (aref bitvector i) 1) (push i indices))))) (defun integer-list (n) "Return a list containing the integers from 0 to N - 1." (let ((integers (list))) (dotimes (i n (nreverse integers)) (push i integers))))

Finally, the first cut of a full solution:

(defun parents (bitvector-array) "Return an array in which each element contains the index of the parent of that element, based on the bit vectors in BITVECTOR-ARRAY." (let* ((relation-matrix (make-related-matrix bitvector-array)) (bitvector-count (first (array-dimensions relation-matrix))) (parents (make-array bitvector-count :element-type 'fixnum :initial-element -1))) (do* ((pending (integer-list bitvector-count) changed) (changed (list) (list))) ((null pending) parents) (dolist (i pending) (let* ((bitvector (row-view relation-matrix i)) (ones (high-bits bitvector))) (when (= 1 (length ones)) (let ((parent (first ones))) (setf (aref parents i) parent) (setf (aref relation-matrix parent i) 0) (push parent changed))))))))

The optimization makes the code slightly more complex, but reduces the number of iterations significantly. Another optimization that might be useful is passing in the relationship matrix and making a copy of it rather than computing it each time.

Let's see how this solution performs. Before running it, I need to write a couple of more utility functions to load the expected result from a file and compare two sets of results.

(defun file->integer-vector (filename) "Create a vector of integers from the data in FILENAME." (let ((values (with-open-file (file filename) (do ((values (list)) (value (read-line file) (read-line file nil 'eof))) ((eq value 'eof) values) (push (parse-integer value) values))))) (make-array (length values) :initial-contents (nreverse values)))) (defun integer-vector-differences (vector-x vector-y) "Return a list of differences between VECTOR-X and VECTOR-Y. Each element in the list is a list of index, x value, and y value." (let ((max-index (max (length vector-x) (length vector-y))) (differences (list))) (dotimes (i max-index (nreverse differences)) (unless (= (aref vector-x i) (aref vector-y i)) (push (list i (aref vector-x i) (aref vector-y i)) differences)))))

Now for the first run:

(defparameter *small-expected* (file->integer-vector "bitvectors-parents.data.small")) (defparameter *small-parents* (parents *small-array*)) (integer-vector-differences *small-expected* *small-parents*) ((125 284 399) (211 28 -1) (284 -1 125) (347 119 -1) (399 125 -1))

Five differences out of five hundred bit vectors. 99% correct. I'm done!

Except, two things bother me. The first is that the solution coughed up more than one root to the inheritance tree. The second is that the real root wasn't correctly identified. One option for fixing the first problem is to do a second pass to figure out the which element of the set of roots is most likely to be the real root and then make the remaining candidates children of the bit vector most closely related to each. Before going to that additional effort, though, it occurs to me that computing the parent child relationship based on differences up to three standard deviations from the mean identifies 99.73% of potential values. There are 499 such relationships in the bit vector data, which means that statistically about 1.3 will be missed. Could that be the source of the extra two roots?

Changing the number of standard deviations to four means that 99.993666% of potential values will be covered. Let's see how that works:

(integer-vector-differences *small-expected* (parents *small-array*)) ((17 420 -1) (92 457 -1) (131 17 -1) (137 385 -1) (337 17 -1) (368 137 -1) (385 284 -1) (417 284 -1) (420 137 -1) (457 284 -1))

Much worse. Okay, I'll bite the bullet and implement a second pass. First, I need a rule for determining which of several candidates is most likely the real root node. Based on the problem description, the real root node is statistically more likely to have more direct descendants than any other node. If two candidates have the same number of parent-child relationships, I'll pick the one with the lowest total when the differences representing those parent-child relationships are summed. I don't have a rigorous mathematical reason for this tie-breaking approach, just a gut feel that it won't be obviously worse than any other.

Since I'm going to need the relationship data for the second pass, I'll first modify the code to save it rather than recalculate.

(defun parents (bitvector-array &optional (relation-matrix (make-related-matrix bitvector-array))) "Return an array in which each element contains the index of the parent of that element, based on the bit vectors in BITVECTOR-ARRAY." (let* ((bitvector-count (first (array-dimensions relation-matrix))) (parents (make-array bitvector-count :element-type 'fixnum :initial-element -1))) (do* ((pending (integer-list bitvector-count) changed) (changed (list) (list))) ((null pending) (values parents relation-matrix)) (dolist (i pending) (let* ((bitvector (row-view relation-matrix i)) (ones (high-bits bitvector))) (when (= 1 (length ones)) (let ((parent (first ones))) (setf (aref parents i) parent) (setf (aref relation-matrix parent i) 0) (push parent changed))))))))

Now, after calculating the parent string, I'll choose the most likely root if there isn't exactly one. This requires handling the case where no root is found as well as when multiple candidates are returned.

(defun roots (parents) "Return the indices of any elements in the vector PARENTS that represent roots of an inheritance hierarchy (-1)." (let ((roots (list))) (dotimes (i (length parents) (nreverse roots)) (when (= -1 (aref parents i)) (push i roots))))) (defun relationship-count (index relationship-matrix) "Return the number of parent-child relationships for the INDEXth row in RELATIONSHIP-MATRIX." (count 1 (row-view relationship-matrix index))) (defun related-indices (index relationship-matrix) "Return a list of the indices toggled on in the INDEXth row of RELATIONSHIP-MATRIX." (let ((row (row-view relationship-matrix index)) (indices (list))) (dotimes (i (length row) (nreverse indices)) (when (= 1 (aref row i)) (push i indices))))) (defun total-distance (index bitvector-array relationship-matrix) "Compute the sum of the bit differences in the parent-child relationships associated with the INDEXth row of RELATIONSHIP-MATRIX, based on the values in BITVECTOR-ARRAY." (let ((row (aref bitvector-array index))) (reduce #'+ (mapcar (lambda (x) (bit-difference row (aref bitvector-array x))) (related-indices index relationship-matrix))))) (defun find-root (bitvector-array relationship-matrix &optional (candidates (integer-list (first (array-dimensions relationship-matrix))))) "Return the index of the most likely root node from the list of CANDIDATES based on the information in BITVECTOR-ARRAY and RELATIONSHIP-MATRIX." (let* ((root (first candidates)) (count (relationship-count root relationship-matrix)) (distance (total-distance root bitvector-array relationship-matrix))) (dolist (candidate (rest candidates) root) (let ((candidate-count (relationship-count candidate relationship-matrix)) (candidate-distance (total-distance candidate bitvector-array relationship-matrix))) (when (or (< count candidate-count) (and (= count candidate-count) (> distance candidate-distance))) (setf root candidate) (setf count candidate-count) (setf distance candidate-distance)))))) (defun possible-parents (index relationship-matrix) "Return a list of possible parents for the INDEXth row in RELATIONSHIP-MATRIX. If no possible parents are listed, return a list of all indices in RELATIONSHIP-MATRIX except INDEX." (let ((indices (related-indices index relationship-matrix))) (if (zerop (length indices)) (remove index (integer-list (first (array-dimensions relationship-matrix)))) indices))) (defun find-parent (index bitvector-array relationship-matrix) "Return the index of the most likely parent of the INDEXth row in BITVECTOR-ARRAY based on the data in BITVECTOR-ARRAY and RELATIONSHIP-MATRIX." (let* ((child (aref bitvector-array index)) (possible-parents (possible-parents index relationship-matrix)) (distances (mapcar (lambda (x) (list x (bit-difference child (aref bitvector-array x)))) possible-parents))) (caar (sort distances #'< :key #'cadr)))) (defun find-parents (indices bitvector-array relationship-matrix) "Return a list of lists containing the index and it's parent for all INDICES based on the data in BITVECTOR-ARRAY and RELATIONSHIP-MATRIX." (let ((relationships (list))) (dolist (index indices relationships) (push (list index (find-parent index bitvector-array relationship-matrix)) relationships)))) (defun update-parents (parents updates) "Apply the UPDATES to the PARENTS list and return the changed list." (dolist (update updates parents) (setf (aref parents (first update)) (second update)))) (defun relationship-tree (bitvector-array) "Return an array in which each element contains the index of the parent of that element, based on the bit vectors in BITVECTOR-ARRAY. Ensure that the tree returned contains exactly one root." (multiple-value-bind (parents relationship-matrix) (parents bitvector-array) (let ((roots (roots parents))) (cond ((zerop (length roots)) ; No root found. (setf (aref parents (find-root bitvector-array relationship-matrix)) -1) parents) ((> (length roots) 1) ; Too many roots. (let ((root (find-root bitvector-array relationship-matrix roots))) (update-parents parents (find-parents (remove root roots) bitvector-array relationship-matrix)))) (t parents))))) ; Got it right first try.

Whew! The 80-20 rule (or the 99-1 rule, in this case) is biting me in the posterior in terms of amount of code. Let's see how this two pass design works:

(integer-vector-differences *small-expected* (relationship-tree *small-array*)) ((125 284 399) (211 28 -1) (284 -1 125) (347 119 58) (399 125 25))

Five differences out of five hundred bit vectors. 99% correct. I'm done!

Oh, wait, that's the same as without those changes. At least now I have hooks in case I come up with a better way of identifying the parents of problem children.

The fact that the real root wasn't found still bothers me. Getting that one wrong despite getting 99% of the others right suggests that there may be a bug that only manifests in that one case. I need to test against other data sets, but this solution description is getting too long already. I'll run the code against the full data set and call it a day.

(defparameter *full-array* (file->bitvector-array "bitvectors-genes.data")) (defparameter *full-parents* (relationship-tree *full-array*)) (integer-vector->file *full-parents* "full.vector")

Answer

The root of the inheritance hierarchy is node 334 (counting from index 0). The full result set is available in the full.vector.gz file. All source is available in the file bitvector-genealogy.lisp.

Paths Not Taken

This solution has not been optimized in any way. The computation of the full data set took several hours. Parts of the puzzle are highly parallelizable. Modifying parents to use threads could dramatically improve performance on multi-core machines.

The identification of the root node needs more investigation. This requires writing a few functions to generate more sample data, which I'll probably try once I finish up a couple of outstanding projects.

Source Code

Please contact me by email if you have any questions, comments, or suggestions.

www.softwarematters.org