Description
The BitVectors are an ancient and immortal race of 10,000, each with a
10,000 bit genome. The race evolved from a single individual by the
following process: 9,999 times a BitVector chosen at random from
amongst the population was cloned using an errorprone process that
replicates each bit of the genome with 80% fidelity (any particular
bit is flipped from parent to child 20% of the time, independent of
other bits).
Write a program to guess the reproductive history of BitVectors from
their genetic material. The randomlyordered file
bitvectorsgenes.data.gz
contains a 10,000 bit line for each individual. Your program's output
should be, for each input line, the 0based line number of that
individual's parent, or 1 if it is the progenitor. Balance
performance against probability of mistakes as you see fit.
To help you test your program, here is a much smaller 500 x 500 input
dataset:
bitvectorsgenes.data.small.gz,
along with its solution file:
bitvectorsparents.data.small.
(This puzzle description is ©
ITA Software.)
Solution
Before getting started on the problem itself, I need a few utilities
to read the data files and convert the contents to an appropriate
representation.
(defun string>bitvector (string)
"Convert a STRING of 0s and 1s to a bit vector."
(let ((bits (makearray (length string)
:elementtype 'bit
:initialelement 0)))
(dotimes (i (length string) bits)
(setf (bit bits i) (if (char= (char string i) #\0) 0 1)))))
(defun file>bitvectorarray (filename)
"Create an array of bit vectors from the data in FILENAME."
(let ((lines (withopenfile (file filename)
(do ((lines (list))
(line (readline file) (readline file nil 'eof)))
((eq line 'eof) lines)
(push (string>bitvector line) lines)))))
(makearray (length lines)
:elementtype 'bitvector
:initialcontents (nreverse lines))))
A quick test of each function shows them to be working as intended:
(defparameter *examplebitstring*
"10100011100110101110110100101110000010110010011000011010011010101001100011000011100111100110101001110001010101000000010100101001100010000111111110111010010010001110100001110101000010011001000000001100001011011010110010110111010101000110101101101100111111101010000110010110011101111100000100100110000001110010100001011111000110111010110111000010100000011111101010010010010011000001000001110101011101010111110100100110010110110101010110100100101010111110000110000000110100010101011110011011000111001011")
(string>bitvector *examplebitstring*)
#*10100011100110101110110100101110000010110010011000011010011010101001100011000011100111100110101001110001010101000000010100101001100010000111111110111010010010001110100001110101000010011001000000001100001011011010110010110111010101000110101101101100111111101010000110010110011101111100000100100110000001110010100001011111000110111010110111000010100000011111101010010010010011000001000001110101011101010111110100100110010110110101010110100100101010111110000110000000110100010101011110011011000111001011
(defparameter *smallarray*
(file>bitvectorarray "bitvectorsgenes.data.small"))
(arraydimension *smallarray* 0)
500
The problem requires understanding how big a difference there is
between two bit strings, so I need to count the number of
nonequivalent bits in two bit vectors. This can be expressed very
cleanly in Common Lisp:
(defun bitdifference (x y)
"Return the number of bits different between bit vectors X and Y."
(count 1 (bitnot (biteqv x y))))
(bitdifference #*110011 #*110000)
2
Now it's time to think a little more about the details of the problem.
First a sanity check to make sure this can be solved in the available
memory. The full problem set contains 10,000 strings of 10,000 bits
each for a total of 100 million bits. Even assuming significant
overhead, that doesn't pose a problem for the 2GB on my laptop. I'll
worry about the CPU required once I've got the algorithm defined.
The next step in developing a solution is to determine whether or not
a parentchild relationship exists between two bit vectors. According
to the problem definition, every bit vector other than the original is
created by randomly mutating the elements of an existing bit vector.
The probability of a particular bit changing is 20%. That means that,
on average, the children of a bit vector should be 80% identical to
their parent and the grandchildren should be 64% similar. Of course,
the random element means that some will have more than 80% similarity
and others less. Some children may be more similar to each other than
either is to their parent. The best result any automated solution
will be able to achieve is the most probable relationship tree.
Assuming that a fair random number generator was used, the set
containing the number of bits different between a parent and each of
its children should follow a
binomial distribution.
The probability of two vectors having a parent child relationship is
related to how close the difference between the two is to the expected
value (20%, in this case). In a binomial distribution (which
approximates a continuous normal distribution), 99.73% of the bit
difference values will be within three standard deviations of the
expected value. The standard deviation is computed by:
(defun binomialstandarddeviation (n p)
"Compute the expected binomial standard deviation for N events of
probability P."
(sqrt (* n p ( 1 p))))
Now I can write a predicate to determine if two bit vectors are
closely related. It's important to note that this predicate only
indicates whether or not there is a parentchild relationship between
the two vectors, not the direction of that relationship.
(defconstant +mutationrate+ 0.2
"The mutation rate when creating a child vector.")
(defun parentchildp (vectorx vectory
&optional (mutationrate +mutationrate+)
(deviations 3)) ;; 3 = 99.73 percent
"Return true if VECTORX and VECTORY have a parentchild relationship.
The relationship is assumed if the bit difference between the two is
less than DEVIATIONS standard deviations from the expected average for
the MUTATIONRATE."
(let* ((size (length vectorx))
(expecteddifference (* size mutationrate))
(standarddeviation (binomialstandarddeviation (length vectorx)
mutationrate))
(maxdifference (+ expecteddifference
(* deviations standarddeviation))))
(< (bitdifference vectorx vectory) maxdifference)))
The next question is, if I know all the parentchild relationships,
how do I determine the direction of those relationships? The answer
is iterative:

Find all the bit vectors with only a single parentchild
relationship
 Assume that those vectors are the child nodes
 Record that information
 Remove the child node from the parent's list of relationships
 Repeat
First, I need to determine the full set of relationships:
(defun makerelatedmatrix (bitvectors)
"Create a matrix of bits indicating if two bit vectors in BITVECTORS
have a parentchild relationship."
(let ((matrix (makearray (list (length bitvectors) (length bitvectors))
:elementtype 'bit
:initialelement 0)))
(dotimes (i (length bitvectors) matrix)
(dotimes (j (length bitvectors))
(when (and (not (= i j))
(parentchildp (aref bitvectors i) (aref bitvectors j)))
(setf (aref matrix i j) 1))))))
This is almost the last piece required to find the inheritance
hierarchy. One more small utility I need is something to slice rows
out of an array:
(defun rowview (matrix index)
"Return a vector referencing the values of the INDEXth row in the two
dimensional array MATRIX."
(makearray (second (arraydimensions matrix))
:elementtype (arrayelementtype matrix)
:displacedto matrix
:displacedindexoffset (arrayrowmajorindex matrix index 0)))
In a fit of premature optimization, I'm going to add a couple of
utilities that will allow me to make only one full pass through the
relationship matrix, after which I'll only look at the rows that were
modified by the last pass.
(defun highbits (bitvector)
"Return a list containing the index of every 1 bit in BITVECTOR."
(let ((indices (list)))
(dotimes (i (length bitvector) (nreverse indices))
(when (= (aref bitvector i) 1)
(push i indices)))))
(defun integerlist (n)
"Return a list containing the integers from 0 to N  1."
(let ((integers (list)))
(dotimes (i n (nreverse integers))
(push i integers))))
Finally, the first cut of a full solution:
(defun parents (bitvectorarray)
"Return an array in which each element contains the index of the
parent of that element, based on the bit vectors in BITVECTORARRAY."
(let* ((relationmatrix (makerelatedmatrix bitvectorarray))
(bitvectorcount (first (arraydimensions relationmatrix)))
(parents (makearray bitvectorcount
:elementtype 'fixnum
:initialelement 1)))
(do* ((pending (integerlist bitvectorcount) changed)
(changed (list) (list)))
((null pending) parents)
(dolist (i pending)
(let* ((bitvector (rowview relationmatrix i))
(ones (highbits bitvector)))
(when (= 1 (length ones))
(let ((parent (first ones)))
(setf (aref parents i) parent)
(setf (aref relationmatrix parent i) 0)
(push parent changed))))))))
The optimization makes the code slightly more complex, but reduces the
number of iterations significantly. Another optimization that might
be useful is passing in the relationship matrix and making a copy of
it rather than computing it each time.
Let's see how this solution performs. Before running it, I need to
write a couple of more utility functions to load the expected result
from a file and compare two sets of results.
(defun file>integervector (filename)
"Create a vector of integers from the data in FILENAME."
(let ((values (withopenfile (file filename)
(do ((values (list))
(value (readline file) (readline file nil 'eof)))
((eq value 'eof) values)
(push (parseinteger value) values)))))
(makearray (length values)
:initialcontents (nreverse values))))
(defun integervectordifferences (vectorx vectory)
"Return a list of differences between VECTORX and VECTORY.
Each element in the list is a list of index, x value, and y value."
(let ((maxindex (max (length vectorx) (length vectory)))
(differences (list)))
(dotimes (i maxindex (nreverse differences))
(unless (= (aref vectorx i) (aref vectory i))
(push (list i (aref vectorx i) (aref vectory i)) differences)))))
Now for the first run:
(defparameter *smallexpected*
(file>integervector "bitvectorsparents.data.small"))
(defparameter *smallparents*
(parents *smallarray*))
(integervectordifferences *smallexpected* *smallparents*)
((125 284 399) (211 28 1) (284 1 125) (347 119 1) (399 125 1))
Five differences out of five hundred bit vectors. 99% correct. I'm
done!
Except, two things bother me. The first is that the solution coughed
up more than one root to the inheritance tree. The second is that the
real root wasn't correctly identified. One option for fixing the
first problem is to do a second pass to figure out the which element
of the set of roots is most likely to be the real root and then make
the remaining candidates children of the bit vector most closely
related to each. Before going to that additional effort, though, it
occurs to me that computing the parent child relationship based on
differences up to three standard deviations from the mean identifies
99.73% of potential values. There are 499 such relationships in the
bit vector data, which means that statistically about 1.3 will be
missed. Could that be the source of the extra two roots?
Changing the number of standard deviations to four means that
99.993666% of potential values will be covered. Let's see how that
works:
(integervectordifferences *smallexpected* (parents *smallarray*))
((17 420 1) (92 457 1) (131 17 1) (137 385 1) (337 17 1) (368 137 1) (385 284 1) (417 284 1) (420 137 1) (457 284 1))
Much worse. Okay, I'll bite the bullet and implement a second pass.
First, I need a rule for determining which of several candidates is
most likely the real root node. Based on the problem description, the
real root node is statistically more likely to have more direct
descendants than any other node. If two candidates have the same
number of parentchild relationships, I'll pick the one with the
lowest total when the differences representing those parentchild
relationships are summed. I don't have a rigorous mathematical reason
for this tiebreaking approach, just a gut feel that it won't be
obviously worse than any other.
Since I'm going to need the relationship data for the second pass,
I'll first modify the code to save it rather than recalculate.
(defun parents (bitvectorarray
&optional (relationmatrix
(makerelatedmatrix bitvectorarray)))
"Return an array in which each element contains the index of the
parent of that element, based on the bit vectors in BITVECTORARRAY."
(let* ((bitvectorcount (first (arraydimensions relationmatrix)))
(parents (makearray bitvectorcount
:elementtype 'fixnum
:initialelement 1)))
(do* ((pending (integerlist bitvectorcount) changed)
(changed (list) (list)))
((null pending) (values parents relationmatrix))
(dolist (i pending)
(let* ((bitvector (rowview relationmatrix i))
(ones (highbits bitvector)))
(when (= 1 (length ones))
(let ((parent (first ones)))
(setf (aref parents i) parent)
(setf (aref relationmatrix parent i) 0)
(push parent changed))))))))
Now, after calculating the parent string, I'll choose the most likely
root if there isn't exactly one. This requires handling the case
where no root is found as well as when multiple candidates are
returned.
(defun roots (parents)
"Return the indices of any elements in the vector PARENTS that
represent roots of an inheritance hierarchy (1)."
(let ((roots (list)))
(dotimes (i (length parents) (nreverse roots))
(when (= 1 (aref parents i))
(push i roots)))))
(defun relationshipcount (index relationshipmatrix)
"Return the number of parentchild relationships for the INDEXth row
in RELATIONSHIPMATRIX."
(count 1 (rowview relationshipmatrix index)))
(defun relatedindices (index relationshipmatrix)
"Return a list of the indices toggled on in the INDEXth row of
RELATIONSHIPMATRIX."
(let ((row (rowview relationshipmatrix index))
(indices (list)))
(dotimes (i (length row) (nreverse indices))
(when (= 1 (aref row i))
(push i indices)))))
(defun totaldistance (index bitvectorarray relationshipmatrix)
"Compute the sum of the bit differences in the parentchild
relationships associated with the INDEXth row of RELATIONSHIPMATRIX,
based on the values in BITVECTORARRAY."
(let ((row (aref bitvectorarray index)))
(reduce #'+ (mapcar (lambda (x)
(bitdifference row (aref bitvectorarray x)))
(relatedindices index relationshipmatrix)))))
(defun findroot (bitvectorarray relationshipmatrix
&optional (candidates
(integerlist
(first (arraydimensions relationshipmatrix)))))
"Return the index of the most likely root node from the list of
CANDIDATES based on the information in BITVECTORARRAY and
RELATIONSHIPMATRIX."
(let* ((root (first candidates))
(count (relationshipcount root relationshipmatrix))
(distance (totaldistance root bitvectorarray relationshipmatrix)))
(dolist (candidate (rest candidates) root)
(let ((candidatecount
(relationshipcount candidate relationshipmatrix))
(candidatedistance
(totaldistance candidate bitvectorarray relationshipmatrix)))
(when (or (< count candidatecount)
(and (= count candidatecount)
(> distance candidatedistance)))
(setf root candidate)
(setf count candidatecount)
(setf distance candidatedistance))))))
(defun possibleparents (index relationshipmatrix)
"Return a list of possible parents for the INDEXth row in
RELATIONSHIPMATRIX. If no possible parents are listed, return a list
of all indices in RELATIONSHIPMATRIX except INDEX."
(let ((indices (relatedindices index relationshipmatrix)))
(if (zerop (length indices))
(remove index (integerlist
(first (arraydimensions relationshipmatrix))))
indices)))
(defun findparent (index bitvectorarray relationshipmatrix)
"Return the index of the most likely parent of the INDEXth row in
BITVECTORARRAY based on the data in BITVECTORARRAY and
RELATIONSHIPMATRIX."
(let* ((child (aref bitvectorarray index))
(possibleparents (possibleparents index relationshipmatrix))
(distances (mapcar (lambda (x)
(list x (bitdifference
child (aref bitvectorarray x))))
possibleparents)))
(caar (sort distances #'< :key #'cadr))))
(defun findparents (indices bitvectorarray relationshipmatrix)
"Return a list of lists containing the index and it's parent for all
INDICES based on the data in BITVECTORARRAY and RELATIONSHIPMATRIX."
(let ((relationships (list)))
(dolist (index indices relationships)
(push (list index
(findparent index bitvectorarray relationshipmatrix))
relationships))))
(defun updateparents (parents updates)
"Apply the UPDATES to the PARENTS list and return the changed list."
(dolist (update updates parents)
(setf (aref parents (first update)) (second update))))
(defun relationshiptree (bitvectorarray)
"Return an array in which each element contains the index of the
parent of that element, based on the bit vectors in BITVECTORARRAY.
Ensure that the tree returned contains exactly one root."
(multiplevaluebind (parents relationshipmatrix)
(parents bitvectorarray)
(let ((roots (roots parents)))
(cond ((zerop (length roots)) ; No root found.
(setf (aref parents (findroot bitvectorarray
relationshipmatrix))
1)
parents)
((> (length roots) 1) ; Too many roots.
(let ((root (findroot bitvectorarray
relationshipmatrix
roots)))
(updateparents parents
(findparents (remove root roots)
bitvectorarray
relationshipmatrix))))
(t parents))))) ; Got it right first try.
Whew! The 8020 rule (or the 991 rule, in this case) is biting me in
the posterior in terms of amount of code. Let's see how this two pass
design works:
(integervectordifferences *smallexpected* (relationshiptree *smallarray*))
((125 284 399) (211 28 1) (284 1 125) (347 119 58) (399 125 25))
Five differences out of five hundred bit vectors. 99% correct. I'm
done!
Oh, wait, that's the same as without those changes. At least now I
have hooks in case I come up with a better way of identifying the
parents of problem children.
The fact that the real root wasn't found still bothers me. Getting
that one wrong despite getting 99% of the others right suggests that
there may be a bug that only manifests in that one case. I need to
test against other data sets, but this solution description is
getting too long already. I'll run the code against the full data
set and call it a day.
(defparameter *fullarray*
(file>bitvectorarray "bitvectorsgenes.data"))
(defparameter *fullparents*
(relationshiptree *fullarray*))
(integervector>file *fullparents* "full.vector")
Answer
The root of the inheritance hierarchy is node 334 (counting from index
0). The full result set is available in the
full.vector.gz file. All
source is available in the file
bitvectorgenealogy.lisp.
Paths Not Taken
This solution has not been optimized in any way. The computation of
the full data set took several hours. Parts of the puzzle are highly
parallelizable. Modifying parents to use threads could dramatically
improve performance on multicore machines.
The identification of the root node needs more investigation. This
requires writing a few functions to generate more sample data, which
I'll probably try once I finish up a couple of outstanding projects.
Source Code
Please contact me by
email if you have any
questions, comments, or suggestions.
www.softwarematters.org