## Description

"If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?" To be precise: if the integers from 1 to 999,999,999 are expressed in words (omitting spaces, 'and', and punctuation[1]), and sorted alphabetically so that the first six integers are

- eight
- eighteen
- eighteenmillion
- eighteenmillioneight
- eighteenmillioneighteen
- eighteenmillioneighteenthousand

and the last is

twothousandtwohundredtwo

then reading top to bottom, left to right, the 28th letter completes the spelling of the integer "eighteenmillion".

The 51 billionth letter also completes the spelling of an integer. Which one, and what is the sum of all the integers to that point?

[1] For example, 911,610,034 is written "ninehundredelevenmillionsixhundredtenthousandthirtyfour"; 500,000,000 is written "fivehundredmillion"; 1,709 is written "onethousandsevenhundrednine".

(This puzzle description is © ITA Software.)

## Solution

This problem is interesting because it has an ugly, brute force solution but begs for a more considered approach. The brute force solution is to generate all of the words, write them to a file, sort them, and start counting. In keeping with the lack of elegance, that approach should be implemented in Perl. Obviously this technique will require at least 51 gigabytes of storage, based solely on the problem description. Disk space is cheap, but there's got to be a better way.

The 51 gigabyte minimum implied by the problem description prohibits running the brute force approach in memory. To get a rough idea of how much memory is required to hold all of the words, the nine digit number 123,456,789 is expressed in English (without spaces and hyphens) as:

onehundredtwentythreemillionfourhundredfiftysixthousandsevenhundredeightynine

That is 77 characters, which suggests that the full set of words will require the better part of 100 gigabytes.

That string was easily computed using Common Lisp, the first programming language that comes to mind when elegance is required. The function used is:

Since storing all of the words is not a reasonable option, the solution will need to trade CPU cycles for space. Fortunately, the problem provides a simple ordering. It isn't necessary to store all the numbers and words, only the total length of the words and the sum of the numbers who's corresponding words appear alphabetically before the one containing the 51 billionth letter. Breaking the search space up into multiple alphabetically ordered pieces will allow the problem to be solved by a repeated divide-and-conquer approach.

First, a little exploratory coding. A histogram of number word stems and the total length for all words with those stems will identify where the 51 billionth letter is hiding. My first cut of a histogram function is:

This function creates a hash table with keys that are word stems and
values that are lists of interesting information about the numbers and
words associated with that stem. In addition to the total length of
all words beginning with the stem, `number-word-histogram`

also tracks the sum of all numbers corresponding to those words and
the number of words with that stem. The number of words is useful as
a sanity check during testing.

A simple test harness shows this function working correctly:

Working correctly and performing well are two different things.
Timing the performance of `number-word-histogram`

on my
laptop gives the following numbers:

Extrapolating from the last result suggests that computing the histogram up to 999,999,999 will take anywhere from 14 to 21 hours on my laptop. Fortunately, I have access to a bigger machine:

This should create the full histogram in about 2 hours and 45 minutes on one CPU of this 8 CPU box. I'll start it running while thinking about making it multithreaded and optimizing the implementation (over lunch).

While that's running, I need to modify the implementation of
`number-word-histogram`

to build a hashtable of only
certain stems and `test-histogram`

to optionally pass in a
base stem.

This still searches the whole range from 1 to 999,999,999 even when a base stem is specified. To limit the amount of duplicate computation, I'll add a minimum value and track the minimum and maximum values associated with each stem.

A couple of quick tests show this is still working:

I'll start this one running on the big machine.

To find out where the 51 billionth letter resides, I need to sort the word stems alphabetically and compute a rolling sum of the letter counts. The first part is straightforward:

Computing the rolling sums requires adding up all of the letter counts
for the pairs up to and including the current pair, and updating the
current pair with that value. `mapcon`

is the obvious
choice for dealing with lists `cdr`

by `cdr`

. A
moment's thought gives the solution (working the first time, of
course, without any need to build it up piece by piece):

(No, I do not recommend using this function on large lists.)

To find the stem of a particular letter index, given a rolling sum list:

The histogram test running on the big machine has finished. The form
`(test-histogram 999999999 12)`

has partitioned the problem
into 443 stems. `stem-for-letter-index`

shows that the 51
billionth letter is in a word with the stem "sixhundredse". That's
not terribly helpful because that stem includes numbers from 607 to
679,999,999.

This does get me closer to a solution, however. With just a few more tools, I can build a function to iteratively search through the space of word numbers, expanding only the one stem of interest in each loop to keep memory requirements tractable:

`word-for-letter-index`

works by successively narrowing the
search space, replacing the entry in the hash table for the stem
containing the letter index with more specific stems until the stem
containing the letter index refers to only one integer. It returns
the integer for the word containing the letter index, plus the
additional results for the problem and some sanity checking data.
Running this on the big machine results in:

## Answer

The 51 billionth letter is 'e' and completes the spelling of 676,746,575. The sum of all the integers up to and including that value, sorted alphabetically, is 413,540,008,163,475,743.

## Paths Not Taken

This solution has not been optimized in any way. It may be possible
to tune it enough to run in reasonable time on a laptop, although a
significant amount of time is spent in the format function. That's an
effort for another day. The puzzle is also highly parallelizable.
Modifying `number-word-histogram`

to use threads would
dramatically improve even untuned performance on multi-core machines.

## Source Code

All source code is in word-numbers.lisp.

Please contact me by email if you have any questions, comments, or suggestions.