Description
"If the integers from 1 to 999,999,999 are written as words, sorted
alphabetically, and concatenated, what is the 51 billionth letter?"
To be precise: if the integers from 1 to 999,999,999 are expressed in
words (omitting spaces, 'and', and punctuation[1]), and sorted
alphabetically so that the first six integers are
- eight
- eighteen
- eighteenmillion
- eighteenmillioneight
- eighteenmillioneighteen
- eighteenmillioneighteenthousand
and the last is
twothousandtwohundredtwo
then reading top to bottom, left to right, the 28th letter completes
the spelling of the integer "eighteenmillion".
The 51 billionth letter also completes the spelling of an
integer. Which one, and what is the sum of all the integers to that
point?
[1] For example, 911,610,034 is written
"ninehundredelevenmillionsixhundredtenthousandthirtyfour"; 500,000,000
is written "fivehundredmillion"; 1,709 is written
"onethousandsevenhundrednine".
(This puzzle description is ©
ITA Software.)
Solution
This problem is interesting because it has an ugly, brute force
solution but begs for a more considered approach. The brute force
solution is to generate all of the words, write them to a file, sort
them, and start counting. In keeping with the lack of elegance, that
approach should be implemented in Perl. Obviously this technique will
require at least 51 gigabytes of storage, based solely on the problem
description. Disk space is cheap, but there's got to be a better way.
The 51 gigabyte minimum implied by the problem description prohibits
running the brute force approach in memory. To get a rough idea of
how much memory is required to hold all of the words, the nine digit
number 123,456,789 is expressed in English (without spaces and
hyphens) as:
onehundredtwentythreemillionfourhundredfiftysixthousandsevenhundredeightynine
That is 77 characters, which suggests that the full set of words will
require the better part of 100 gigabytes.
That string was easily computed using Common Lisp, the first
programming language that comes to mind when
elegance is
required. The function used is:
(defun stripped-number-word (i)
"Return the English representation of I with spaces and hyphens removed."
(remove-if-not #'alpha-char-p (format nil "~R" i)))
Since storing all of the words is not a reasonable option, the
solution will need to trade CPU cycles for space. Fortunately, the
problem provides a simple ordering. It isn't necessary to store all
the numbers and words, only the total length of the words and the sum
of the numbers who's corresponding words appear alphabetically before
the one containing the 51 billionth letter. Breaking the search space
up into multiple alphabetically ordered pieces will allow the problem
to be solved by a repeated divide-and-conquer approach.
First, a little exploratory coding. A histogram of number word stems
and the total length for all words with those stems will identify
where the 51 billionth letter is hiding. My first cut of a histogram
function is:
(defun number-word-histogram (max-value stem-size)
"For all integers from 1 to MAX-VALUE, return the number of integers,
the total length of the English words for their values, and their sum
for those whose words are the same up to STEM-SIZE."
(flet ((stem-tally (stem histogram)
(multiple-value-bind (tally exists) (gethash stem histogram)
(if exists
tally
(list 0 0 0)))))
(let ((histogram (make-hash-table :test 'equal)))
(dotimes (i max-value histogram)
(let* ((word (stripped-number-word (1+ i)))
(stem (subseq word 0 (min (length word) stem-size)))
(tally (stem-tally stem histogram)))
(setf (gethash stem histogram)
(list (1+ (first tally))
(+ (second tally) (length word))
(+ (third tally) (1+ i)))))))))
This function creates a hash table with keys that are word stems and
values that are lists of interesting information about the numbers and
words associated with that stem. In addition to the total length of
all words beginning with the stem, number-word-histogram
also tracks the sum of all numbers corresponding to those words and
the number of words with that stem. The number of words is useful as
a sanity check during testing.
A simple test harness shows this function working correctly:
(defun test-histogram (max-value stem-size)
(let ((totals (list 0 0 0)))
(maphash (lambda (key value)
(setf totals (mapcar #'+ totals value))
(format t "~S ~S~&" key value))
(number-word-histogram max-value stem-size))
totals))
(test-histogram 10000 4)
"sixt" (1011 29533 6500161)
"twot" (1000 29440 2499500)
"six" (1 3 6)
"ten" (1 3 10)
"eigh" (1112 33603 8585321)
"twel" (1 6 12)
"twen" (10 96 245)
"thre" (1101 33499 3534453)
"thir" (11 104 358)
"seve" (1112 33614 7575219)
"two" (1 3 2)
"one" (1 3 1)
"fift" (11 93 560)
"oneh" (100 1854 14950)
"five" (1101 32398 5554455)
"fort" (10 86 445)
"tent" (1 11 10000)
"onet" (1000 29440 1499500)
"elev" (1 6 11)
"four" (1102 32406 4544468)
"sixh" (100 1854 64950)
"nine" (1112 32502 9595423)
"twoh" (100 1854 24950)
(10000 292411 50005000)
Working correctly and performing well are two different things.
Timing the performance of number-word-histogram on my
laptop gives the following numbers:
? (time (number-word-histogram 10000 4))
(NUMBER-WORD-HISTOGRAM 10000 4) took 553 milliseconds (0.553 seconds) to run
with 1 available CPU core.
During that period, 445 milliseconds (0.445 seconds) were spent in user mode
16 milliseconds (0.016 seconds) were spent in system mode
32 milliseconds (0.032 seconds) was spent in GC.
5,033,584 bytes of memory allocated.
#
? (time (number-word-histogram 100000 5))
(NUMBER-WORD-HISTOGRAM 100000 5) took 6,230 milliseconds (6.230 seconds) to run
with 1 available CPU core.
During that period, 4,513 milliseconds (4.513 seconds) were spent in user mode
170 milliseconds (0.170 seconds) were spent in system mode
152 milliseconds (0.152 seconds) was spent in GC.
55,418,672 bytes of memory allocated.
#
? (time (number-word-histogram 1000000 6))
(NUMBER-WORD-HISTOGRAM 1000000 6) took 76,564 milliseconds (76.564 seconds) to run
with 1 available CPU core.
During that period, 48,774 milliseconds (48.774 seconds) were spent in user mode
1,879 milliseconds (1.879 seconds) were spent in system mode
2,176 milliseconds (2.176 seconds) was spent in GC.
661,388,344 bytes of memory allocated.
#
Extrapolating from the last result suggests that computing the
histogram up to 999,999,999 will take anywhere from 14 to 21 hours on
my laptop. Fortunately, I have access to a bigger machine:
* (time (number-word-histogram 1000000 6))
Evaluation took:
9.766 seconds of real time
9.760517 seconds of total run time (9.716523 user, 0.043994 system)
[ Run times consist of 0.208 seconds GC time, and 9.553 seconds non-GC time. ]
99.95% CPU
24,295,513,567 processor cycles
1,476,985,376 bytes consed
This should create the full histogram in about 2 hours and 45 minutes
on one CPU of this 8 CPU box. I'll start it running while thinking
about making it multithreaded and optimizing the implementation (over
lunch).
While that's running, I need to modify the implementation of
number-word-histogram to build a hashtable of only
certain stems and test-histogram to optionally pass in a
base stem.
(defun number-word-histogram (max-value stem-size &optional (base-stem nil))
"For all integers from 1 to MAX-VALUE, return the number of integers,
the total length of the English words for their values, and their sum
for those whose words are the same up to STEM-SIZE. If BASE-STEM is
specified, the word must begin with it."
(flet ((bad-stem-p (word)
(and base-stem
(or (< (length word) (length base-stem))
(string/= (subseq word 0 (length base-stem)) base-stem))))
(stem-tally (stem histogram)
(multiple-value-bind (tally exists) (gethash stem histogram)
(if exists
tally
(list 0 0 0)))))
(let ((histogram (make-hash-table :test 'equal))
(max-stem-size (+ stem-size (length base-stem))))
(dotimes (i max-value histogram)
(let ((word (stripped-number-word (1+ i))))
(unless (bad-stem-p word)
(let* ((stem-length (min (length word) max-stem-size))
(stem (subseq word 0 stem-length))
(tally (stem-tally stem histogram)))
(setf (gethash stem histogram)
(list (1+ (first tally))
(+ (second tally) (length word))
(+ (third tally) (1+ i)))))))))))
(defun test-histogram (max-value stem-size &optional (base-stem nil))
(let ((totals (list 0 0 0)))
(maphash (lambda (key value)
(setf totals (mapcar #'+ totals value))
(format t "~S ~S~&" key value))
(number-word-histogram max-value stem-size base-stem))
totals))
This still searches the whole range from 1 to 999,999,999 even when a
base stem is specified. To limit the amount of duplicate computation,
I'll add a minimum value and track the minimum and maximum values
associated with each stem.
(defun number-word-histogram (max-value stem-size
&optional (base-stem nil) (min-value 1))
"For all integers from MIN-VALUE to MAX-VALUE, return the number of
integers, the total length of the English words for their values, and
their sum for those whose words are the same up to STEM-SIZE. If
BASE-STEM is specified, the word must begin with it."
(flet ((bad-stem-p (word)
(and base-stem
(or (< (length word) (length base-stem))
(string/= (subseq word 0 (length base-stem)) base-stem))))
(stem-tally (stem histogram)
(multiple-value-bind (tally exists) (gethash stem histogram)
(if exists
tally
(list 0 0 0 0 0)))))
(let ((histogram (make-hash-table :test 'equal))
(max-stem-size (+ stem-size (length base-stem))))
(do* ((i min-value (1+ i))
(word (stripped-number-word i) (stripped-number-word i)))
((> i max-value) histogram)
(unless (bad-stem-p word)
(let* ((stem-length (min (length word) max-stem-size))
(stem (subseq word 0 stem-length))
(tally (stem-tally stem histogram)))
(setf (gethash stem histogram)
(list (1+ (first tally))
(+ (second tally) (length word))
(+ (third tally) i)
(if (zerop (fourth tally)) i (min i (fourth tally)))
(max i (fifth tally))))))))))
(defun test-histogram (max-value stem-size &optional (base-stem nil))
(let ((totals (list 0 0 0 0 0)))
(maphash (lambda (key value)
(setf totals (mapcar #'+ totals value))
(format t "~S ~S~&" key value))
(number-word-histogram max-value stem-size base-stem))
totals))
A couple of quick tests show this is still working:
(test-histogram 10000 4)
"sixt" (1011 29533 6500161 16 6999)
"twot" (1000 29440 2499500 2000 2999)
"six" (1 3 6 6 6)
"ten" (1 3 10 10 10)
"eigh" (1112 33603 8585321 8 8999)
"twel" (1 6 12 12 12)
"twen" (10 96 245 20 29)
"thre" (1101 33499 3534453 3 3999)
"thir" (11 104 358 13 39)
"seve" (1112 33614 7575219 7 7999)
"two" (1 3 2 2 2)
"one" (1 3 1 1 1)
"fift" (11 93 560 15 59)
"oneh" (100 1854 14950 100 199)
"five" (1101 32398 5554455 5 5999)
"fort" (10 86 445 40 49)
"tent" (1 11 10000 10000 10000)
"onet" (1000 29440 1499500 1000 1999)
"elev" (1 6 11 11 11)
"four" (1102 32406 4544468 4 4999)
"sixh" (100 1854 64950 600 699)
"nine" (1112 32502 9595423 9 9999)
"twoh" (100 1854 24950 200 299)
(10000 292411 50005000 14082 65406)
(test-histogram 10000 2 "nine")
"ninete" (1 8 19 19 19)
"nineth" (1000 30440 9499500 9000 9999)
"ninehu" (100 1954 94950 900 999)
"ninety" (10 96 945 90 99)
"nine" (1 4 9 9 9)
(1112 32502 9595423 10018 11125)
I'll start this one running on the big machine.
To find out where the 51 billionth letter resides, I need to sort the
word stems alphabetically and compute a rolling sum of the letter
counts. The first part is straightforward:
(defun sorted-stem-letters (histogram)
"Return a list of word stems and the total number of word letters up
to and including that stem, ordered alphabetically."
(let ((letters (list)))
(maphash (lambda (key value)
(push (list key (second value)) letters))
histogram)
(sort letters #'string< :key #'car)))
(sorted-stem-letters (number-word-histogram 10000 5))
(("eight" 33603) ("eleve" 6) ("fifte" 7) ("fifty" 86) ("five" 4) ("fiveh" 1954) ("fivet" 30440) ("forty" 86) ("four" 4) ("fourh" 1954) ("fourt" 30448) ("nine" 4) ("nineh" 1954) ("ninet" 30544) ("one" 3) ("onehu" 1854) ("oneth" 29440) ("seven" 33614) ("six" 3) ("sixhu" 1854) ("sixte" 7) ("sixth" 29440) ("sixty" 86) ("ten" 3) ("tenth" 11) ("thirt" 104) ("three" 33499) ("twelv" 6) ("twent" 96) ("two" 3) ("twohu" 1854) ("twoth" 29440))
Computing the rolling sums requires adding up all of the letter counts
for the pairs up to and including the current pair, and updating the
current pair with that value. mapcon is the obvious
choice for dealing with lists cdr by cdr. A
moment's thought gives the solution (working the first time, of
course, without any need to build it up piece by piece):
(defun rolling-sum (name-value-list)
"Return a list representing the rolling sums of the values in
NAME-VALUE-LIST. NAME-VALUE-LIST is assumed to be in alphabetical
(or other appropriate) order."
(let ((reversed-list-of-lists
(mapcon #'list (reverse name-value-list))))
(reverse
(mapcar #'list
(mapcar #'caar reversed-list-of-lists)
(mapcar (lambda (list) (apply #'+ list))
(mapcar (lambda (list) (mapcar #'cadr list))
reversed-list-of-lists))))))
(rolling-sum (sorted-stem-letters (number-word-histogram 10000 5)))
(("eight" 33603) ("eleve" 33609) ("fifte" 33616) ("fifty" 33702) ("five" 33706) ("fiveh" 35660) ("fivet" 66100) ("forty" 66186) ("four" 66190) ("fourh" 68144) ("fourt" 98592) ("nine" 98596) ("nineh" 100550) ("ninet" 131094) ("one" 131097) ("onehu" 132951) ("oneth" 162391) ("seven" 196005) ("six" 196008) ("sixhu" 197862) ("sixte" 197869) ("sixth" 227309) ("sixty" 227395) ("ten" 227398) ("tenth" 227409) ("thirt" 227513) ("three" 261012) ("twelv" 261018) ("twent" 261114) ("two" 261117) ("twohu" 262971) ("twoth" 292411))
(No, I do not recommend using this function on large lists.)
To find the stem of a particular letter index, given a rolling sum
list:
(defun stem-for-letter-index (histogram index)
"Return from the HISTOGRAM the word stem associated with the letter INDEX."
(let ((sums (rolling-sum (sorted-stem-letters histogram))))
(first (find-if (lambda (item) (<= index (second item))) sums))))
(stem-for-letter-index (number-word-histogram 10000 5) 100000)
"nineh"
The histogram test running on the big machine has finished. The form
(test-histogram 999999999 12) has partitioned the problem
into 443 stems. stem-for-letter-index shows that the 51
billionth letter is in a word with the stem "sixhundredse". That's
not terribly helpful because that stem includes numbers from 607 to
679,999,999.
This does get me closer to a solution, however. With just a few more
tools, I can build a function to iteratively search through the space
of word numbers, expanding only the one stem of interest in each loop
to keep memory requirements tractable:
(defun unique-stem-p (stem-data)
"Indicate if the STEM-DATA represents exactly one integer."
(equal (fourth stem-data) (fifth stem-data)))
(defun integer-value (stem-data)
"Return the value of the integer in STEM-DATA. STEM-DATA is assumed
to represent a unique integer."
(third stem-data))
(defun min-value (stem-data)
"Return the minimum value of the integer in STEM-DATA."
(fourth stem-data))
(defun max-value (stem-data)
"Return the maximum value of the integer in STEM-DATA."
(fifth stem-data))
(defun sorted-stem-sums (histogram)
"Return a list of word stems and the total sum of integers up to and
including that stem, ordered alphabetically."
(let ((sums (list)))
(maphash (lambda (key value)
(push (list key (third value)) sums))
histogram)
(sort sums #'string< :key #'car)))
(defun sum-of-integers (histogram stem)
"Return the sum of all integers alphabetically less than or equal to
STEM in HISTOGRAM."
(let ((sums (rolling-sum (sorted-stem-sums histogram))))
(second (find-if (lambda (item) (string= (first item) stem)) sums))))
(defun sum-of-letter-lengths (histogram stem)
"Return from the HISTOGRAM the word stem associated with the letter INDEX."
(let ((sums (rolling-sum (sorted-stem-letters histogram))))
(second (find-if (lambda (item) (string= (first item) stem)) sums))))
(defun word-for-letter-index (top-of-range index
&optional (initial-stem-length 10)
(stem-length-increment 10))
"Alphabetically sort the words for the integers from 1 to TOP-OF-RANGE,
concatenate them, and determine the integer for the word that contains
the INDEXth letter. Also return the sum of all integers for the words
less than or equal to the word containing the INDEXth letter."
(do* ((histogram (number-word-histogram top-of-range initial-stem-length)
histogram)
(stem (stem-for-letter-index histogram index)
(stem-for-letter-index histogram index))
(stem-data (gethash stem histogram) (gethash stem histogram)))
((unique-stem-p stem-data)
(values (integer-value stem-data)
(sum-of-integers histogram stem)
stem-data
(sum-of-letter-lengths histogram stem)))
(remhash stem histogram)
(maphash (lambda (key value)
(setf (gethash key histogram) value))
(number-word-histogram (max-value stem-data)
stem-length-increment
stem
(min-value stem-data)))))
word-for-letter-index works by successively narrowing the
search space, replacing the entry in the hash table for the stem
containing the letter index with more specific stems until the stem
containing the letter index refers to only one integer. It returns
the integer for the word containing the letter index, plus the
additional results for the problem and some sanity checking data.
Running this on the big machine results in:
* (word-for-letter-index 999999999 51000000000)
676746575
413540008163475743
(1 77 676746575 676746575 676746575)
51000000000
Answer
The 51 billionth letter is 'e' and completes the spelling of
676,746,575. The sum of all the integers up to and including that
value, sorted alphabetically, is 413,540,008,163,475,743.
Paths Not Taken
This solution has not been optimized in any way. It may be possible
to tune it enough to run in reasonable time on a laptop, although a
significant amount of time is spent in the format function. That's an
effort for another day. The puzzle is also highly parallelizable.
Modifying number-word-histogram to use threads would
dramatically improve even untuned performance on multi-core machines.
Source Code
All source code is in
word-numbers.lisp.
Please contact me by
email if you have any
questions, comments, or suggestions.
www.softwarematters.org