Description

"If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?" To be precise: if the integers from 1 to 999,999,999 are expressed in words (omitting spaces, 'and', and punctuation[1]), and sorted alphabetically so that the first six integers are

  • eight
  • eighteen
  • eighteenmillion
  • eighteenmillioneight
  • eighteenmillioneighteen
  • eighteenmillioneighteenthousand

and the last is

twothousandtwohundredtwo

then reading top to bottom, left to right, the 28th letter completes the spelling of the integer "eighteenmillion".

The 51 billionth letter also completes the spelling of an integer. Which one, and what is the sum of all the integers to that point?

[1] For example, 911,610,034 is written "ninehundredelevenmillionsixhundredtenthousandthirtyfour"; 500,000,000 is written "fivehundredmillion"; 1,709 is written "onethousandsevenhundrednine".

(This puzzle description is © ITA Software.)

Solution

This problem is interesting because it has an ugly, brute force solution but begs for a more considered approach. The brute force solution is to generate all of the words, write them to a file, sort them, and start counting. In keeping with the lack of elegance, that approach should be implemented in Perl. Obviously this technique will require at least 51 gigabytes of storage, based solely on the problem description. Disk space is cheap, but there's got to be a better way.

The 51 gigabyte minimum implied by the problem description prohibits running the brute force approach in memory. To get a rough idea of how much memory is required to hold all of the words, the nine digit number 123,456,789 is expressed in English (without spaces and hyphens) as:

onehundredtwentythreemillionfourhundredfiftysixthousandsevenhundredeightynine

That is 77 characters, which suggests that the full set of words will require the better part of 100 gigabytes.

That string was easily computed using Common Lisp, the first programming language that comes to mind when elegance is required. The function used is:

(defun stripped-number-word (i) "Return the English representation of I with spaces and hyphens removed." (remove-if-not #'alpha-char-p (format nil "~R" i)))

Since storing all of the words is not a reasonable option, the solution will need to trade CPU cycles for space. Fortunately, the problem provides a simple ordering. It isn't necessary to store all the numbers and words, only the total length of the words and the sum of the numbers who's corresponding words appear alphabetically before the one containing the 51 billionth letter. Breaking the search space up into multiple alphabetically ordered pieces will allow the problem to be solved by a repeated divide-and-conquer approach.

First, a little exploratory coding. A histogram of number word stems and the total length for all words with those stems will identify where the 51 billionth letter is hiding. My first cut of a histogram function is:

(defun number-word-histogram (max-value stem-size) "For all integers from 1 to MAX-VALUE, return the number of integers, the total length of the English words for their values, and their sum for those whose words are the same up to STEM-SIZE." (flet ((stem-tally (stem histogram) (multiple-value-bind (tally exists) (gethash stem histogram) (if exists tally (list 0 0 0))))) (let ((histogram (make-hash-table :test 'equal))) (dotimes (i max-value histogram) (let* ((word (stripped-number-word (1+ i))) (stem (subseq word 0 (min (length word) stem-size))) (tally (stem-tally stem histogram))) (setf (gethash stem histogram) (list (1+ (first tally)) (+ (second tally) (length word)) (+ (third tally) (1+ i)))))))))

This function creates a hash table with keys that are word stems and values that are lists of interesting information about the numbers and words associated with that stem. In addition to the total length of all words beginning with the stem, number-word-histogram also tracks the sum of all numbers corresponding to those words and the number of words with that stem. The number of words is useful as a sanity check during testing.

A simple test harness shows this function working correctly:

(defun test-histogram (max-value stem-size) (let ((totals (list 0 0 0))) (maphash (lambda (key value) (setf totals (mapcar #'+ totals value)) (format t "~S ~S~&" key value)) (number-word-histogram max-value stem-size)) totals)) (test-histogram 10000 4) "sixt" (1011 29533 6500161) "twot" (1000 29440 2499500) "six" (1 3 6) "ten" (1 3 10) "eigh" (1112 33603 8585321) "twel" (1 6 12) "twen" (10 96 245) "thre" (1101 33499 3534453) "thir" (11 104 358) "seve" (1112 33614 7575219) "two" (1 3 2) "one" (1 3 1) "fift" (11 93 560) "oneh" (100 1854 14950) "five" (1101 32398 5554455) "fort" (10 86 445) "tent" (1 11 10000) "onet" (1000 29440 1499500) "elev" (1 6 11) "four" (1102 32406 4544468) "sixh" (100 1854 64950) "nine" (1112 32502 9595423) "twoh" (100 1854 24950) (10000 292411 50005000)

Working correctly and performing well are two different things. Timing the performance of number-word-histogram on my laptop gives the following numbers:

? (time (number-word-histogram 10000 4)) (NUMBER-WORD-HISTOGRAM 10000 4) took 553 milliseconds (0.553 seconds) to run with 1 available CPU core. During that period, 445 milliseconds (0.445 seconds) were spent in user mode 16 milliseconds (0.016 seconds) were spent in system mode 32 milliseconds (0.032 seconds) was spent in GC. 5,033,584 bytes of memory allocated. # ? (time (number-word-histogram 100000 5)) (NUMBER-WORD-HISTOGRAM 100000 5) took 6,230 milliseconds (6.230 seconds) to run with 1 available CPU core. During that period, 4,513 milliseconds (4.513 seconds) were spent in user mode 170 milliseconds (0.170 seconds) were spent in system mode 152 milliseconds (0.152 seconds) was spent in GC. 55,418,672 bytes of memory allocated. # ? (time (number-word-histogram 1000000 6)) (NUMBER-WORD-HISTOGRAM 1000000 6) took 76,564 milliseconds (76.564 seconds) to run with 1 available CPU core. During that period, 48,774 milliseconds (48.774 seconds) were spent in user mode 1,879 milliseconds (1.879 seconds) were spent in system mode 2,176 milliseconds (2.176 seconds) was spent in GC. 661,388,344 bytes of memory allocated. #

Extrapolating from the last result suggests that computing the histogram up to 999,999,999 will take anywhere from 14 to 21 hours on my laptop. Fortunately, I have access to a bigger machine:

* (time (number-word-histogram 1000000 6)) Evaluation took: 9.766 seconds of real time 9.760517 seconds of total run time (9.716523 user, 0.043994 system) [ Run times consist of 0.208 seconds GC time, and 9.553 seconds non-GC time. ] 99.95% CPU 24,295,513,567 processor cycles 1,476,985,376 bytes consed

This should create the full histogram in about 2 hours and 45 minutes on one CPU of this 8 CPU box. I'll start it running while thinking about making it multithreaded and optimizing the implementation (over lunch).

While that's running, I need to modify the implementation of number-word-histogram to build a hashtable of only certain stems and test-histogram to optionally pass in a base stem.

(defun number-word-histogram (max-value stem-size &optional (base-stem nil)) "For all integers from 1 to MAX-VALUE, return the number of integers, the total length of the English words for their values, and their sum for those whose words are the same up to STEM-SIZE. If BASE-STEM is specified, the word must begin with it." (flet ((bad-stem-p (word) (and base-stem (or (< (length word) (length base-stem)) (string/= (subseq word 0 (length base-stem)) base-stem)))) (stem-tally (stem histogram) (multiple-value-bind (tally exists) (gethash stem histogram) (if exists tally (list 0 0 0))))) (let ((histogram (make-hash-table :test 'equal)) (max-stem-size (+ stem-size (length base-stem)))) (dotimes (i max-value histogram) (let ((word (stripped-number-word (1+ i)))) (unless (bad-stem-p word) (let* ((stem-length (min (length word) max-stem-size)) (stem (subseq word 0 stem-length)) (tally (stem-tally stem histogram))) (setf (gethash stem histogram) (list (1+ (first tally)) (+ (second tally) (length word)) (+ (third tally) (1+ i))))))))))) (defun test-histogram (max-value stem-size &optional (base-stem nil)) (let ((totals (list 0 0 0))) (maphash (lambda (key value) (setf totals (mapcar #'+ totals value)) (format t "~S ~S~&" key value)) (number-word-histogram max-value stem-size base-stem)) totals))

This still searches the whole range from 1 to 999,999,999 even when a base stem is specified. To limit the amount of duplicate computation, I'll add a minimum value and track the minimum and maximum values associated with each stem.

(defun number-word-histogram (max-value stem-size &optional (base-stem nil) (min-value 1)) "For all integers from MIN-VALUE to MAX-VALUE, return the number of integers, the total length of the English words for their values, and their sum for those whose words are the same up to STEM-SIZE. If BASE-STEM is specified, the word must begin with it." (flet ((bad-stem-p (word) (and base-stem (or (< (length word) (length base-stem)) (string/= (subseq word 0 (length base-stem)) base-stem)))) (stem-tally (stem histogram) (multiple-value-bind (tally exists) (gethash stem histogram) (if exists tally (list 0 0 0 0 0))))) (let ((histogram (make-hash-table :test 'equal)) (max-stem-size (+ stem-size (length base-stem)))) (do* ((i min-value (1+ i)) (word (stripped-number-word i) (stripped-number-word i))) ((> i max-value) histogram) (unless (bad-stem-p word) (let* ((stem-length (min (length word) max-stem-size)) (stem (subseq word 0 stem-length)) (tally (stem-tally stem histogram))) (setf (gethash stem histogram) (list (1+ (first tally)) (+ (second tally) (length word)) (+ (third tally) i) (if (zerop (fourth tally)) i (min i (fourth tally))) (max i (fifth tally)))))))))) (defun test-histogram (max-value stem-size &optional (base-stem nil)) (let ((totals (list 0 0 0 0 0))) (maphash (lambda (key value) (setf totals (mapcar #'+ totals value)) (format t "~S ~S~&" key value)) (number-word-histogram max-value stem-size base-stem)) totals))

A couple of quick tests show this is still working:

(test-histogram 10000 4) "sixt" (1011 29533 6500161 16 6999) "twot" (1000 29440 2499500 2000 2999) "six" (1 3 6 6 6) "ten" (1 3 10 10 10) "eigh" (1112 33603 8585321 8 8999) "twel" (1 6 12 12 12) "twen" (10 96 245 20 29) "thre" (1101 33499 3534453 3 3999) "thir" (11 104 358 13 39) "seve" (1112 33614 7575219 7 7999) "two" (1 3 2 2 2) "one" (1 3 1 1 1) "fift" (11 93 560 15 59) "oneh" (100 1854 14950 100 199) "five" (1101 32398 5554455 5 5999) "fort" (10 86 445 40 49) "tent" (1 11 10000 10000 10000) "onet" (1000 29440 1499500 1000 1999) "elev" (1 6 11 11 11) "four" (1102 32406 4544468 4 4999) "sixh" (100 1854 64950 600 699) "nine" (1112 32502 9595423 9 9999) "twoh" (100 1854 24950 200 299) (10000 292411 50005000 14082 65406) (test-histogram 10000 2 "nine") "ninete" (1 8 19 19 19) "nineth" (1000 30440 9499500 9000 9999) "ninehu" (100 1954 94950 900 999) "ninety" (10 96 945 90 99) "nine" (1 4 9 9 9) (1112 32502 9595423 10018 11125)

I'll start this one running on the big machine.

To find out where the 51 billionth letter resides, I need to sort the word stems alphabetically and compute a rolling sum of the letter counts. The first part is straightforward:

(defun sorted-stem-letters (histogram) "Return a list of word stems and the total number of word letters up to and including that stem, ordered alphabetically." (let ((letters (list))) (maphash (lambda (key value) (push (list key (second value)) letters)) histogram) (sort letters #'string< :key #'car))) (sorted-stem-letters (number-word-histogram 10000 5)) (("eight" 33603) ("eleve" 6) ("fifte" 7) ("fifty" 86) ("five" 4) ("fiveh" 1954) ("fivet" 30440) ("forty" 86) ("four" 4) ("fourh" 1954) ("fourt" 30448) ("nine" 4) ("nineh" 1954) ("ninet" 30544) ("one" 3) ("onehu" 1854) ("oneth" 29440) ("seven" 33614) ("six" 3) ("sixhu" 1854) ("sixte" 7) ("sixth" 29440) ("sixty" 86) ("ten" 3) ("tenth" 11) ("thirt" 104) ("three" 33499) ("twelv" 6) ("twent" 96) ("two" 3) ("twohu" 1854) ("twoth" 29440))

Computing the rolling sums requires adding up all of the letter counts for the pairs up to and including the current pair, and updating the current pair with that value. mapcon is the obvious choice for dealing with lists cdr by cdr. A moment's thought gives the solution (working the first time, of course, without any need to build it up piece by piece):

(defun rolling-sum (name-value-list) "Return a list representing the rolling sums of the values in NAME-VALUE-LIST. NAME-VALUE-LIST is assumed to be in alphabetical (or other appropriate) order." (let ((reversed-list-of-lists (mapcon #'list (reverse name-value-list)))) (reverse (mapcar #'list (mapcar #'caar reversed-list-of-lists) (mapcar (lambda (list) (apply #'+ list)) (mapcar (lambda (list) (mapcar #'cadr list)) reversed-list-of-lists)))))) (rolling-sum (sorted-stem-letters (number-word-histogram 10000 5))) (("eight" 33603) ("eleve" 33609) ("fifte" 33616) ("fifty" 33702) ("five" 33706) ("fiveh" 35660) ("fivet" 66100) ("forty" 66186) ("four" 66190) ("fourh" 68144) ("fourt" 98592) ("nine" 98596) ("nineh" 100550) ("ninet" 131094) ("one" 131097) ("onehu" 132951) ("oneth" 162391) ("seven" 196005) ("six" 196008) ("sixhu" 197862) ("sixte" 197869) ("sixth" 227309) ("sixty" 227395) ("ten" 227398) ("tenth" 227409) ("thirt" 227513) ("three" 261012) ("twelv" 261018) ("twent" 261114) ("two" 261117) ("twohu" 262971) ("twoth" 292411))

(No, I do not recommend using this function on large lists.)

To find the stem of a particular letter index, given a rolling sum list:

(defun stem-for-letter-index (histogram index) "Return from the HISTOGRAM the word stem associated with the letter INDEX." (let ((sums (rolling-sum (sorted-stem-letters histogram)))) (first (find-if (lambda (item) (<= index (second item))) sums)))) (stem-for-letter-index (number-word-histogram 10000 5) 100000) "nineh"

The histogram test running on the big machine has finished. The form (test-histogram 999999999 12) has partitioned the problem into 443 stems. stem-for-letter-index shows that the 51 billionth letter is in a word with the stem "sixhundredse". That's not terribly helpful because that stem includes numbers from 607 to 679,999,999.

This does get me closer to a solution, however. With just a few more tools, I can build a function to iteratively search through the space of word numbers, expanding only the one stem of interest in each loop to keep memory requirements tractable:

(defun unique-stem-p (stem-data) "Indicate if the STEM-DATA represents exactly one integer." (equal (fourth stem-data) (fifth stem-data))) (defun integer-value (stem-data) "Return the value of the integer in STEM-DATA. STEM-DATA is assumed to represent a unique integer." (third stem-data)) (defun min-value (stem-data) "Return the minimum value of the integer in STEM-DATA." (fourth stem-data)) (defun max-value (stem-data) "Return the maximum value of the integer in STEM-DATA." (fifth stem-data)) (defun sorted-stem-sums (histogram) "Return a list of word stems and the total sum of integers up to and including that stem, ordered alphabetically." (let ((sums (list))) (maphash (lambda (key value) (push (list key (third value)) sums)) histogram) (sort sums #'string< :key #'car))) (defun sum-of-integers (histogram stem) "Return the sum of all integers alphabetically less than or equal to STEM in HISTOGRAM." (let ((sums (rolling-sum (sorted-stem-sums histogram)))) (second (find-if (lambda (item) (string= (first item) stem)) sums)))) (defun sum-of-letter-lengths (histogram stem) "Return from the HISTOGRAM the word stem associated with the letter INDEX." (let ((sums (rolling-sum (sorted-stem-letters histogram)))) (second (find-if (lambda (item) (string= (first item) stem)) sums)))) (defun word-for-letter-index (top-of-range index &optional (initial-stem-length 10) (stem-length-increment 10)) "Alphabetically sort the words for the integers from 1 to TOP-OF-RANGE, concatenate them, and determine the integer for the word that contains the INDEXth letter. Also return the sum of all integers for the words less than or equal to the word containing the INDEXth letter." (do* ((histogram (number-word-histogram top-of-range initial-stem-length) histogram) (stem (stem-for-letter-index histogram index) (stem-for-letter-index histogram index)) (stem-data (gethash stem histogram) (gethash stem histogram))) ((unique-stem-p stem-data) (values (integer-value stem-data) (sum-of-integers histogram stem) stem-data (sum-of-letter-lengths histogram stem))) (remhash stem histogram) (maphash (lambda (key value) (setf (gethash key histogram) value)) (number-word-histogram (max-value stem-data) stem-length-increment stem (min-value stem-data)))))

word-for-letter-index works by successively narrowing the search space, replacing the entry in the hash table for the stem containing the letter index with more specific stems until the stem containing the letter index refers to only one integer. It returns the integer for the word containing the letter index, plus the additional results for the problem and some sanity checking data. Running this on the big machine results in:

* (word-for-letter-index 999999999 51000000000) 676746575 413540008163475743 (1 77 676746575 676746575 676746575) 51000000000

Answer

The 51 billionth letter is 'e' and completes the spelling of 676,746,575. The sum of all the integers up to and including that value, sorted alphabetically, is 413,540,008,163,475,743.

Paths Not Taken

This solution has not been optimized in any way. It may be possible to tune it enough to run in reasonable time on a laptop, although a significant amount of time is spent in the format function. That's an effort for another day. The puzzle is also highly parallelizable. Modifying number-word-histogram to use threads would dramatically improve even untuned performance on multi-core machines.

Source Code

All source code is in word-numbers.lisp.

Please contact me by email if you have any questions, comments, or suggestions.

www.softwarematters.org