Description
The BitVectors are an ancient and immortal race of 10,000, each with a
10,000 bit genome. The race evolved from a single individual by the
following process: 9,999 times a BitVector chosen at random from
amongst the population was cloned using an error-prone process that
replicates each bit of the genome with 80% fidelity (any particular
bit is flipped from parent to child 20% of the time, independent of
other bits).
Write a program to guess the reproductive history of BitVectors from
their genetic material. The randomly-ordered file
bitvectors-genes.data.gz
contains a 10,000 bit line for each individual. Your program's output
should be, for each input line, the 0-based line number of that
individual's parent, or -1 if it is the progenitor. Balance
performance against probability of mistakes as you see fit.
To help you test your program, here is a much smaller 500 x 500 input
dataset:
bitvectors-genes.data.small.gz,
along with its solution file:
bitvectors-parents.data.small.
(This puzzle description is ©
ITA Software.)
Solution
Before getting started on the problem itself, I need a few utilities
to read the data files and convert the contents to an appropriate
representation.
(defun string->bit-vector (string)
"Convert a STRING of 0s and 1s to a bit vector."
(let ((bits (make-array (length string)
:element-type 'bit
:initial-element 0)))
(dotimes (i (length string) bits)
(setf (bit bits i) (if (char= (char string i) #\0) 0 1)))))
(defun file->bitvector-array (filename)
"Create an array of bit vectors from the data in FILENAME."
(let ((lines (with-open-file (file filename)
(do ((lines (list))
(line (read-line file) (read-line file nil 'eof)))
((eq line 'eof) lines)
(push (string->bit-vector line) lines)))))
(make-array (length lines)
:element-type 'bit-vector
:initial-contents (nreverse lines))))
A quick test of each function shows them to be working as intended:
(defparameter *example-bit-string*
"10100011100110101110110100101110000010110010011000011010011010101001100011000011100111100110101001110001010101000000010100101001100010000111111110111010010010001110100001110101000010011001000000001100001011011010110010110111010101000110101101101100111111101010000110010110011101111100000100100110000001110010100001011111000110111010110111000010100000011111101010010010010011000001000001110101011101010111110100100110010110110101010110100100101010111110000110000000110100010101011110011011000111001011")
(string->bit-vector *example-bit-string*)
#*10100011100110101110110100101110000010110010011000011010011010101001100011000011100111100110101001110001010101000000010100101001100010000111111110111010010010001110100001110101000010011001000000001100001011011010110010110111010101000110101101101100111111101010000110010110011101111100000100100110000001110010100001011111000110111010110111000010100000011111101010010010010011000001000001110101011101010111110100100110010110110101010110100100101010111110000110000000110100010101011110011011000111001011
(defparameter *small-array*
(file->bitvector-array "bitvectors-genes.data.small"))
(array-dimension *small-array* 0)
500
The problem requires understanding how big a difference there is
between two bit strings, so I need to count the number of
non-equivalent bits in two bit vectors. This can be expressed very
cleanly in Common Lisp:
(defun bit-difference (x y)
"Return the number of bits different between bit vectors X and Y."
(count 1 (bit-not (bit-eqv x y))))
(bit-difference #*110011 #*110000)
2
Now it's time to think a little more about the details of the problem.
First a sanity check to make sure this can be solved in the available
memory. The full problem set contains 10,000 strings of 10,000 bits
each for a total of 100 million bits. Even assuming significant
overhead, that doesn't pose a problem for the 2GB on my laptop. I'll
worry about the CPU required once I've got the algorithm defined.
The next step in developing a solution is to determine whether or not
a parent-child relationship exists between two bit vectors. According
to the problem definition, every bit vector other than the original is
created by randomly mutating the elements of an existing bit vector.
The probability of a particular bit changing is 20%. That means that,
on average, the children of a bit vector should be 80% identical to
their parent and the grandchildren should be 64% similar. Of course,
the random element means that some will have more than 80% similarity
and others less. Some children may be more similar to each other than
either is to their parent. The best result any automated solution
will be able to achieve is the most probable relationship tree.
Assuming that a fair random number generator was used, the set
containing the number of bits different between a parent and each of
its children should follow a
binomial distribution.
The probability of two vectors having a parent child relationship is
related to how close the difference between the two is to the expected
value (20%, in this case). In a binomial distribution (which
approximates a continuous normal distribution), 99.73% of the bit
difference values will be within three standard deviations of the
expected value. The standard deviation is computed by:
(defun binomial-standard-deviation (n p)
"Compute the expected binomial standard deviation for N events of
probability P."
(sqrt (* n p (- 1 p))))
Now I can write a predicate to determine if two bit vectors are
closely related. It's important to note that this predicate only
indicates whether or not there is a parent-child relationship between
the two vectors, not the direction of that relationship.
(defconstant +mutation-rate+ 0.2
"The mutation rate when creating a child vector.")
(defun parent-child-p (vector-x vector-y
&optional (mutation-rate +mutation-rate+)
(deviations 3)) ;; 3 = 99.73 percent
"Return true if VECTOR-X and VECTOR-Y have a parent-child relationship.
The relationship is assumed if the bit difference between the two is
less than DEVIATIONS standard deviations from the expected average for
the MUTATION-RATE."
(let* ((size (length vector-x))
(expected-difference (* size mutation-rate))
(standard-deviation (binomial-standard-deviation (length vector-x)
mutation-rate))
(max-difference (+ expected-difference
(* deviations standard-deviation))))
(< (bit-difference vector-x vector-y) max-difference)))
The next question is, if I know all the parent-child relationships,
how do I determine the direction of those relationships? The answer
is iterative:
-
Find all the bit vectors with only a single parent-child
relationship
- Assume that those vectors are the child nodes
- Record that information
- Remove the child node from the parent's list of relationships
- Repeat
First, I need to determine the full set of relationships:
(defun make-related-matrix (bitvectors)
"Create a matrix of bits indicating if two bit vectors in BITVECTORS
have a parent-child relationship."
(let ((matrix (make-array (list (length bitvectors) (length bitvectors))
:element-type 'bit
:initial-element 0)))
(dotimes (i (length bitvectors) matrix)
(dotimes (j (length bitvectors))
(when (and (not (= i j))
(parent-child-p (aref bitvectors i) (aref bitvectors j)))
(setf (aref matrix i j) 1))))))
This is almost the last piece required to find the inheritance
hierarchy. One more small utility I need is something to slice rows
out of an array:
(defun row-view (matrix index)
"Return a vector referencing the values of the INDEXth row in the two
dimensional array MATRIX."
(make-array (second (array-dimensions matrix))
:element-type (array-element-type matrix)
:displaced-to matrix
:displaced-index-offset (array-row-major-index matrix index 0)))
In a fit of premature optimization, I'm going to add a couple of
utilities that will allow me to make only one full pass through the
relationship matrix, after which I'll only look at the rows that were
modified by the last pass.
(defun high-bits (bitvector)
"Return a list containing the index of every 1 bit in BITVECTOR."
(let ((indices (list)))
(dotimes (i (length bitvector) (nreverse indices))
(when (= (aref bitvector i) 1)
(push i indices)))))
(defun integer-list (n)
"Return a list containing the integers from 0 to N - 1."
(let ((integers (list)))
(dotimes (i n (nreverse integers))
(push i integers))))
Finally, the first cut of a full solution:
(defun parents (bitvector-array)
"Return an array in which each element contains the index of the
parent of that element, based on the bit vectors in BITVECTOR-ARRAY."
(let* ((relation-matrix (make-related-matrix bitvector-array))
(bitvector-count (first (array-dimensions relation-matrix)))
(parents (make-array bitvector-count
:element-type 'fixnum
:initial-element -1)))
(do* ((pending (integer-list bitvector-count) changed)
(changed (list) (list)))
((null pending) parents)
(dolist (i pending)
(let* ((bitvector (row-view relation-matrix i))
(ones (high-bits bitvector)))
(when (= 1 (length ones))
(let ((parent (first ones)))
(setf (aref parents i) parent)
(setf (aref relation-matrix parent i) 0)
(push parent changed))))))))
The optimization makes the code slightly more complex, but reduces the
number of iterations significantly. Another optimization that might
be useful is passing in the relationship matrix and making a copy of
it rather than computing it each time.
Let's see how this solution performs. Before running it, I need to
write a couple of more utility functions to load the expected result
from a file and compare two sets of results.
(defun file->integer-vector (filename)
"Create a vector of integers from the data in FILENAME."
(let ((values (with-open-file (file filename)
(do ((values (list))
(value (read-line file) (read-line file nil 'eof)))
((eq value 'eof) values)
(push (parse-integer value) values)))))
(make-array (length values)
:initial-contents (nreverse values))))
(defun integer-vector-differences (vector-x vector-y)
"Return a list of differences between VECTOR-X and VECTOR-Y.
Each element in the list is a list of index, x value, and y value."
(let ((max-index (max (length vector-x) (length vector-y)))
(differences (list)))
(dotimes (i max-index (nreverse differences))
(unless (= (aref vector-x i) (aref vector-y i))
(push (list i (aref vector-x i) (aref vector-y i)) differences)))))
Now for the first run:
(defparameter *small-expected*
(file->integer-vector "bitvectors-parents.data.small"))
(defparameter *small-parents*
(parents *small-array*))
(integer-vector-differences *small-expected* *small-parents*)
((125 284 399) (211 28 -1) (284 -1 125) (347 119 -1) (399 125 -1))
Five differences out of five hundred bit vectors. 99% correct. I'm
done!
Except, two things bother me. The first is that the solution coughed
up more than one root to the inheritance tree. The second is that the
real root wasn't correctly identified. One option for fixing the
first problem is to do a second pass to figure out the which element
of the set of roots is most likely to be the real root and then make
the remaining candidates children of the bit vector most closely
related to each. Before going to that additional effort, though, it
occurs to me that computing the parent child relationship based on
differences up to three standard deviations from the mean identifies
99.73% of potential values. There are 499 such relationships in the
bit vector data, which means that statistically about 1.3 will be
missed. Could that be the source of the extra two roots?
Changing the number of standard deviations to four means that
99.993666% of potential values will be covered. Let's see how that
works:
(integer-vector-differences *small-expected* (parents *small-array*))
((17 420 -1) (92 457 -1) (131 17 -1) (137 385 -1) (337 17 -1) (368 137 -1) (385 284 -1) (417 284 -1) (420 137 -1) (457 284 -1))
Much worse. Okay, I'll bite the bullet and implement a second pass.
First, I need a rule for determining which of several candidates is
most likely the real root node. Based on the problem description, the
real root node is statistically more likely to have more direct
descendants than any other node. If two candidates have the same
number of parent-child relationships, I'll pick the one with the
lowest total when the differences representing those parent-child
relationships are summed. I don't have a rigorous mathematical reason
for this tie-breaking approach, just a gut feel that it won't be
obviously worse than any other.
Since I'm going to need the relationship data for the second pass,
I'll first modify the code to save it rather than recalculate.
(defun parents (bitvector-array
&optional (relation-matrix
(make-related-matrix bitvector-array)))
"Return an array in which each element contains the index of the
parent of that element, based on the bit vectors in BITVECTOR-ARRAY."
(let* ((bitvector-count (first (array-dimensions relation-matrix)))
(parents (make-array bitvector-count
:element-type 'fixnum
:initial-element -1)))
(do* ((pending (integer-list bitvector-count) changed)
(changed (list) (list)))
((null pending) (values parents relation-matrix))
(dolist (i pending)
(let* ((bitvector (row-view relation-matrix i))
(ones (high-bits bitvector)))
(when (= 1 (length ones))
(let ((parent (first ones)))
(setf (aref parents i) parent)
(setf (aref relation-matrix parent i) 0)
(push parent changed))))))))
Now, after calculating the parent string, I'll choose the most likely
root if there isn't exactly one. This requires handling the case
where no root is found as well as when multiple candidates are
returned.
(defun roots (parents)
"Return the indices of any elements in the vector PARENTS that
represent roots of an inheritance hierarchy (-1)."
(let ((roots (list)))
(dotimes (i (length parents) (nreverse roots))
(when (= -1 (aref parents i))
(push i roots)))))
(defun relationship-count (index relationship-matrix)
"Return the number of parent-child relationships for the INDEXth row
in RELATIONSHIP-MATRIX."
(count 1 (row-view relationship-matrix index)))
(defun related-indices (index relationship-matrix)
"Return a list of the indices toggled on in the INDEXth row of
RELATIONSHIP-MATRIX."
(let ((row (row-view relationship-matrix index))
(indices (list)))
(dotimes (i (length row) (nreverse indices))
(when (= 1 (aref row i))
(push i indices)))))
(defun total-distance (index bitvector-array relationship-matrix)
"Compute the sum of the bit differences in the parent-child
relationships associated with the INDEXth row of RELATIONSHIP-MATRIX,
based on the values in BITVECTOR-ARRAY."
(let ((row (aref bitvector-array index)))
(reduce #'+ (mapcar (lambda (x)
(bit-difference row (aref bitvector-array x)))
(related-indices index relationship-matrix)))))
(defun find-root (bitvector-array relationship-matrix
&optional (candidates
(integer-list
(first (array-dimensions relationship-matrix)))))
"Return the index of the most likely root node from the list of
CANDIDATES based on the information in BITVECTOR-ARRAY and
RELATIONSHIP-MATRIX."
(let* ((root (first candidates))
(count (relationship-count root relationship-matrix))
(distance (total-distance root bitvector-array relationship-matrix)))
(dolist (candidate (rest candidates) root)
(let ((candidate-count
(relationship-count candidate relationship-matrix))
(candidate-distance
(total-distance candidate bitvector-array relationship-matrix)))
(when (or (< count candidate-count)
(and (= count candidate-count)
(> distance candidate-distance)))
(setf root candidate)
(setf count candidate-count)
(setf distance candidate-distance))))))
(defun possible-parents (index relationship-matrix)
"Return a list of possible parents for the INDEXth row in
RELATIONSHIP-MATRIX. If no possible parents are listed, return a list
of all indices in RELATIONSHIP-MATRIX except INDEX."
(let ((indices (related-indices index relationship-matrix)))
(if (zerop (length indices))
(remove index (integer-list
(first (array-dimensions relationship-matrix))))
indices)))
(defun find-parent (index bitvector-array relationship-matrix)
"Return the index of the most likely parent of the INDEXth row in
BITVECTOR-ARRAY based on the data in BITVECTOR-ARRAY and
RELATIONSHIP-MATRIX."
(let* ((child (aref bitvector-array index))
(possible-parents (possible-parents index relationship-matrix))
(distances (mapcar (lambda (x)
(list x (bit-difference
child (aref bitvector-array x))))
possible-parents)))
(caar (sort distances #'< :key #'cadr))))
(defun find-parents (indices bitvector-array relationship-matrix)
"Return a list of lists containing the index and it's parent for all
INDICES based on the data in BITVECTOR-ARRAY and RELATIONSHIP-MATRIX."
(let ((relationships (list)))
(dolist (index indices relationships)
(push (list index
(find-parent index bitvector-array relationship-matrix))
relationships))))
(defun update-parents (parents updates)
"Apply the UPDATES to the PARENTS list and return the changed list."
(dolist (update updates parents)
(setf (aref parents (first update)) (second update))))
(defun relationship-tree (bitvector-array)
"Return an array in which each element contains the index of the
parent of that element, based on the bit vectors in BITVECTOR-ARRAY.
Ensure that the tree returned contains exactly one root."
(multiple-value-bind (parents relationship-matrix)
(parents bitvector-array)
(let ((roots (roots parents)))
(cond ((zerop (length roots)) ; No root found.
(setf (aref parents (find-root bitvector-array
relationship-matrix))
-1)
parents)
((> (length roots) 1) ; Too many roots.
(let ((root (find-root bitvector-array
relationship-matrix
roots)))
(update-parents parents
(find-parents (remove root roots)
bitvector-array
relationship-matrix))))
(t parents))))) ; Got it right first try.
Whew! The 80-20 rule (or the 99-1 rule, in this case) is biting me in
the posterior in terms of amount of code. Let's see how this two pass
design works:
(integer-vector-differences *small-expected* (relationship-tree *small-array*))
((125 284 399) (211 28 -1) (284 -1 125) (347 119 58) (399 125 25))
Five differences out of five hundred bit vectors. 99% correct. I'm
done!
Oh, wait, that's the same as without those changes. At least now I
have hooks in case I come up with a better way of identifying the
parents of problem children.
The fact that the real root wasn't found still bothers me. Getting
that one wrong despite getting 99% of the others right suggests that
there may be a bug that only manifests in that one case. I need to
test against other data sets, but this solution description is
getting too long already. I'll run the code against the full data
set and call it a day.
(defparameter *full-array*
(file->bitvector-array "bitvectors-genes.data"))
(defparameter *full-parents*
(relationship-tree *full-array*))
(integer-vector->file *full-parents* "full.vector")
Answer
The root of the inheritance hierarchy is node 334 (counting from index
0). The full result set is available in the
full.vector.gz file. All
source is available in the file
bitvector-genealogy.lisp.
Paths Not Taken
This solution has not been optimized in any way. The computation of
the full data set took several hours. Parts of the puzzle are highly
parallelizable. Modifying parents to use threads could dramatically
improve performance on multi-core machines.
The identification of the root node needs more investigation. This
requires writing a few functions to generate more sample data, which
I'll probably try once I finish up a couple of outstanding projects.
Source Code
Please contact me by
email if you have any
questions, comments, or suggestions.
www.softwarematters.org