Description

"How long a chain of overlapping movie titles, like Sling Blade Runner, can you find?"

Use the following listing of movie titles: MOVIES.LST. Multi-word overlaps, as in "License to Kill a Mockingbird," are allowed. The same title may not be used more than once in a solution. Heuristic solutions that may not always produce the greatest number of titles will be accepted: seek a reasonable tradeoff of efficiency and optimality.

Data provided by MovieLens at the University of Minnesota.

(This puzzle description is © ITA Software.)

Solution

After seeking elegant solutions to the Word Numbers and BitVector Genealogy puzzles, I'm in the mood to brute force this one. I'll use C and demonstrate some bit twiddling and pointer bashing. First, I need to define the data structures. MOVIES.LST is only about 120kb, so I'll load the whole file into memory and treat it as an array.

/* * Load the entire contents of the specified file into memory. The * caller is responsible for freeing the returned memory. */ char *load_file(char *filename) { char *contents = NULL; long length = 0; FILE *file = fopen(filename,"r"); if (file) { fseek(file,0L,SEEK_END); length = ftell(file); rewind(file); contents = malloc((length + 1) * sizeof(char)); fread(contents,length,1,file); contents[length] = '\0'; } else fprintf(stderr,"Unable to open file %s.\n",filename); return contents; } /* Count the number of occurrences of a character in a string. */ long count_character(char *string,char character) { long count = 0; while (*string) if (*string++ == character) ++count; return count; } /* * Convert a list of '\n' separated strings into an array of '\0' * terminated strings. The passed array is modified. The caller is * responsible for freeing the memory for the returned array. */ char **lines_to_array(char *lines) { char **string_array = NULL; long string_count = count_character(lines,'\n'); long index = 0; string_array = malloc(string_count * sizeof(char *)); string_array[0] = lines; while (*lines) { if (*lines == '\n') { *lines = '\0'; if (++index < string_count) string_array[index] = lines + 1; } ++lines; } return string_array; } /* Solve the puzzle. */ int main(int argc,char *argv[]) { char *lines = NULL; char **titles = NULL; long count = 0; if (argc == 2) if (lines = load_file(argv[1])) { titles = lines_to_array(lines); while (*titles++) ++count; fprintf(stdout,"Found %d titles in %s.\n",count,argv[1]); } else fprintf(stderr,"No words found in %s.\n",argv[1]); else fprintf(stderr,"Usage: %s \n",argv[0]); }

Working in other languages has made some of my C non-idiomatic. I can always go back and make the identifiers less descriptive later.

gcc movie.c followed by ./a.out MOVIES.LST gives a result of "Found 6562 titles in MOVIES.LST." which is in agreement with wc -l.

Now that I have the titles in memory, I can think about what data structure is going to make it easiest to simply read the longest combination of titles. Each title has a set of head strings and a set of tail strings. The heads are all word sequences that do not include the last word and the tails are all word sequences that do not include the first. For "A Hard Days Night" the set of head strings is ("A", "A Hard", "A Hard Days") and the set of tail strings is ("Hard Days Night", "Days Night", "Night"). To solve the puzzle, I need to overlap as many heads and tails of distinct titles as possible.

One approach is to create a structure containing each substring, a list of the titles for which it is in the set of heads, and a list of the titles for which it is in the set of tails, e.g.:

typedef struct title_substring { char *substring; long *title_heads; long *title_tails; } title_substring;

Combined with another structure that holds each title and points to all related title_substring structures, this allows a directed acyclic graph to be created after the titles are parsed. The deepest branch on that graph represents the longest combination of titles.

That's an interesting design, but not brute force enough for my mood. I'll start by putting the first title into a pool, then I'll add the other titles one at a time, plus all possible combinations that can be made with the new title and the existing pool. I will have to keep track of which titles are used in each member of the population, but when it's done, I'll have created all possible combinations and can just pick the longest one.

It's important to note that brute force doesn't require that the implementation be particularly stupid. The number of string comparisons could grow geometrically if this were done naively. I will still use a variant of the title_substring structure to avoid having to look at each and every member of the population when a new title is added. I'll start by defining the data structures:

/* A singly linked list to hold title indices. */ typedef struct title_index_node { long index; struct title_index_node *next; } title_index_node; typedef struct title_index_list { title_index_node *head; title_index_node *tail; } title_index_list; /* * A structure to hold an aggregate title (a title made up of one or * more other titles). */ typedef struct aggregate_title { char *title; title_index_list *component_title_indices; } aggregate_title; /* A singly linked list to hold aggregate titles. */ typedef struct aggregate_title_node { aggregate_title *title; struct aggregate_title_node *next; } aggregate_title_node; typedef struct aggregate_title_list { aggregate_title_node *head; aggregate_title_node *tail; } aggregate_title_list;

I'll need some utility functions to operate on these structures:

/* * Create a new title_index_node. The caller is responsible for freeing * the returned memory. */ title_index_node *new_title_index_node(long index) { title_index_node *node = (title_index_node *)malloc(sizeof(title_index_node)); node->index = index; node->next = NULL; return node; } /* * Create a new title_index_list. The caller is responsible for freeing * the returned memory. */ title_index_list *new_title_index_list() { title_index_list *list = (title_index_list *)malloc(sizeof(title_index_list)); list->head = list->tail = NULL; return list; } /* Add an index to a list. */ void add_index(title_index_list *list,long index) { title_index_node *node = new_title_index_node(index); if (list->head == NULL) list->head = list->tail = node; else list->tail = list->tail->next = node; } /* * Create a new aggregate_title. The caller is responsible for * freeing the returned memory. */ aggregate_title *new_aggregate_title(char *title,long index) { aggregate_title *aggregate = (aggregate_title *)malloc(sizeof(aggregate_title)); aggregate->title = title; aggregate->component_title_indices = new_title_index_list(); add_index(aggregate->component_title_indices,index); return aggregate; } /* * Create a new aggregate_title_node. The caller is responsible for * freeing the returned memory. */ aggregate_title_node *new_aggregate_title_node(aggregate_title *title) { aggregate_title_node *node = (aggregate_title_node *)malloc(sizeof(aggregate_title_node)); node->title = title; node->next = NULL; return node; } /* * Create a new aggregate_title_list. The caller is responsible for * freeing the returned memory. */ aggregate_title_list *new_aggregate_title_list() { aggregate_title_list *list = (aggregate_title_list *)malloc(sizeof(aggregate_title_list)); list->head = list->tail = NULL; return list; } /* Add an aggregate_title to a list. */ void add_aggregate_title(aggregate_title_list *list,aggregate_title *title) { aggregate_title_node *node = new_aggregate_title_node(title); if (list->head == NULL) list->head = list->tail = node; else list->tail = list->tail->next = node; }

I'll test this by building a population consisting of only the original movie titles.

/* * Create a population of all aggregate titles that can be generated * from the specified list of titles. The caller is responsible for * freeing the memory for the returned array. */ aggregate_title_list *create_title_population(char **titles) { aggregate_title_list *population = new_aggregate_title_list(); long count = 0; /* first cut: no aggregation */ while (*titles) add_aggregate_title(population,new_aggregate_title(*titles++,count++)); return population; }

A couple of functions to find the longest aggregate title in the population, some changes to main and I'm ready for the first end-to-end test. To make testing easier, I'll use a test data file consisting only of titles that include the words "sling", "blade", and "runner".

/* Return the number of indices in the specified aggregate_title. */ int title_index_count(aggregate_title *aggregate) { title_index_list *list = aggregate->component_title_indices; title_index_node *node = list->head; int count = 0; while (node) { ++count; node = node->next; } return count; } /* Find the aggregate title composed of the most titles. */ aggregate_title *find_longest_title(aggregate_title_list *list) { aggregate_title *longest = NULL; aggregate_title_node *node = list->head; int max_index_count = 0; int index_count; while (node) { if ((index_count = title_index_count(node->title)) > max_index_count) { longest = node->title; max_index_count = index_count; } node = node->next; } return longest; } /* Solve the puzzle. */ int main(int argc,char *argv[]) { char **titles = NULL; aggregate_title_list *population = NULL; aggregate_title *longest_title = NULL; title_index_node *title_index = NULL; if (argc == 2) { titles = load_titles(argv[1]); population = create_title_population(titles); longest_title = find_longest_title(population); if (longest_title) { fprintf(stdout,"Longest aggregate title: %s\n",longest_title->title); title_index = longest_title->component_title_indices->head; while (title_index) { fprintf(stdout,"\t%s\n",titles[title_index->index]); title_index = title_index->next; } } else fprintf(stderr,"Null aggregate title found.\n"); } else fprintf(stderr,"Usage: %s \n",argv[0]); }

A few quick tests and some poking around in the debugger shows that everything is being loaded correctly. Now it's time to flesh out the create_title_population function to do something more interesting. First I'll need to create some data structures to maintain the heads and tails of each title:

/* A structure to hold the lists of titles associated with a substring. */ typedef struct title_substring { char *substring; aggregate_title_list *heads; aggregate_title_list *tails; } title_substring; /* A singly linked list to hold title substrings. */ typedef struct title_substring_node { title_substring *substring; struct title_substring_node *next; } title_substring_node; typedef struct title_substring_list { title_substring_node *head; title_substring_node *tail; } title_substring_list;

And, of course, the utility functions around these structures (I just love C):

/* * Create a new title_substring. The caller is responsible for freeing * the returned memory. */ title_substring *new_title_substring(char *string) { title_substring *substring = (title_substring *)malloc(sizeof(title_substring)); substring->substring = string; substring->heads = new_aggregate_title_list(); substring->tails = new_aggregate_title_list(); return substring; } /* Flag a title index as having a substring as its head. */ void add_head_title(title_substring *substring,aggregate_title *title) { add_aggregate_title(substring->heads,title); } /* Flag a title index as having a substring as its tail. */ void add_tail_title(title_substring *substring,aggregate_title *title) { add_aggregate_title(substring->tails,title); } /* * Create a new title_substring_node. The caller is responsible for * freeing the returned memory. */ title_substring_node *new_title_substring_node(title_substring *substring) { title_substring_node *node = (title_substring_node *)malloc(sizeof(title_substring_node)); node->substring = substring; node->next = NULL; return node; } /* * Create a new title_substring_list. The caller is responsible for * freeing the returned memory. */ title_substring_list *new_title_substring_list() { title_substring_list *list = (title_substring_list *)malloc(sizeof(title_substring_list)); list->head = list->tail = NULL; return list; }

If I were going to have any more linked lists, I'd use a library for them. This data structure, in particular, may have to change to some sort of tree if the time spent searching the substrings is significant, so I'll leave it hand coded for now.

Finding the most overlapping title heads and tails requires first identifying the heads and tails for each title:

/* * Return a list of all substrings consisting of full words up to but * not including the final word of the specified title. The caller is * responsible for freeing the returned memory. */ char **head_substrings(char *title) { int substring_count = count_character(title,' '); char **head_strings = (char **)malloc((substring_count + 1) * sizeof(char *)); char *working_title = strdup(title); int i; head_strings[substring_count] = '\0'; for (i = 0;i < substring_count;i++) { *strrchr(working_title,' ') = '\0'; head_strings[i] = strdup(working_title); } free(working_title); return head_strings; } /* * Return a list of all substrings consisting of full words starting * with but not including the first word of the specified title. The * caller is responsible for freeing the returned memory. */ char **tail_substrings(char *title) { int substring_count = count_character(title,' '); char **tail_strings = (char **)malloc((substring_count + 1) * sizeof(char *)); char *last_tail = title; int i; tail_strings[substring_count] = '\0'; for (i = 0;i < substring_count;i++) last_tail = tail_strings[i] = strchr(last_tail,' ') + 1; return tail_strings; }

I've been trying to minimize memory allocation by reusing string pointers wherever possible, but creating the head substrings requires some additional memory. Testing this showed that some of the titles in MOVIES.LST have a trailing space. I edited the file to remove blank lines, eliminate trailing whitespace, and ensure that the final title is terminated with a newline.

An important point to note about this implementation is that neither the set of head substrings nor the set of tail substrings includes the title string. A title that fully overlaps with another isn't particularly interesting.

Now I can modify create_title_population to check for overlapping heads and tails. If an overlap is found, a new aggregate title is created and added to the population.

While implementing this, I discovered (code speaks) that I could minimize the number of searches through the list of title substrings by adding pointers from each aggregate_title to the head and tail substrings for that title. To prevent memory leaks from all this sharing, I'll adopt the convention that all aggregate_title_nodes are owned by the population aggregate_title_list and all title_substring_nodes are owned by the title_substrings title_substring_list. When one list is created from or added to another, only shallow copies of the nodes will be performed. This requires a few tweaks to the data structures:

/* * A structure to hold an aggregate title (a title made up of one or * more other titles). */ typedef struct title_substring_list title_substring_list; typedef struct aggregate_title { char *title; title_index_list *component_title_indices; title_substring_list *heads; title_substring_list *tails; } aggregate_title;

I'll start by finding all the titles in the population that have head substrings that overlap with any of the tail substrings of the title being added:

/* * Return a title_substring_list containing a node for each substring * passed in. If a title_substring_node does not exist in the specified * list, create one. This function modifies the passed list. The * caller is responsible for freeing the memory for the additional nodes * and for the returned list. */ title_substring_list *title_substrings(title_substring_list *starting_list, char **substrings) { title_substring_list *substring_list = new_title_substring_list(); char **substring_ptr = substrings; title_substring_node *node; title_substring *title_substring = NULL; while (*substring_ptr) { if (node = find_substring(starting_list,*substring_ptr)) add_title_substring_node(substring_list, new_title_substring_node(node->substring)); else { title_substring = new_title_substring(*substring_ptr); add_title_substring_node(starting_list, new_title_substring_node(title_substring)); add_title_substring_node(substring_list, new_title_substring_node(title_substring)); } *substring_ptr++; } return substring_list; } /* * Find all titles that overlap with the tails of the specified * aggregate_title, create new aggregate_titles from them, and add them * to the specified aggregate_title_list. */ void find_tail_overlaps(aggregate_title *title,aggregate_title_list *list) { title_substring_node *substring_node = NULL; aggregate_title_node *title_node = NULL; aggregate_title *aggregate_title = NULL; substring_node = title->tails->head; while (substring_node) { title_node = substring_node->substring->heads->head; while (title_node) { if (aggregate_title = combine_titles(title,title_node->title)) add_aggregate_title(list,aggregate_title); title_node = title_node->next; } substring_node = substring_node->next; } } /* * Create a population of all aggregate titles that can be generated * from the specified list of titles. The caller is responsible for * freeing the memory for the returned array. */ aggregate_title_list *create_title_population(char **titles) { aggregate_title_list *population = new_aggregate_title_list(); title_substring_list *title_subtitles = new_title_substring_list(); aggregate_title *title = NULL; long count = 0; while (*titles) { fprintf(stdout, "create_title_population(): processing title %s\n", *titles); title = new_base_title(*titles++,count++); set_tail_substrings(title, title_substrings(title_subtitles, tail_substrings(title->title))); find_tail_overlaps(title,population); add_aggregate_title(population,title); } return population; }

These require several supporting functions:

/* * Return the title_substring_node in the specified list containing the * specified string, or null if it isn't found. The substrings in the * list are assumed to be in alphabetical order. */ title_substring_node *find_substring(title_substring_list *list,char *string) { title_substring_node *node = list->head; title_substring_node *match = NULL; while (node && !match) { if (strcmp(node->substring->substring,string) == 0) match = node; else if (strcmp(node->substring->substring,string) < 0) node = node->next; else node = NULL; } return match; } /* Return 0 if the two titles have no components in common. */ int shared_titles(aggregate_title *head,aggregate_title *tail) { int shared = 0; title_index_node *head_index_node = head->component_title_indices->head; title_index_node *tail_index_node = NULL; while (head_index_node && !shared) { tail_index_node = tail->component_title_indices->head; while (tail_index_node && !shared) { if (tail_index_node->index == head_index_node->index) ++shared; tail_index_node = tail_index_node->next; } head_index_node = head_index_node->next; } return shared; } /* Return the longest overlapping substrings for the two titles. */ char *longest_overlap(aggregate_title *head,aggregate_title *tail) { title_substring_node *head_node = head->tails->head; title_substring_node *tail_node = NULL; char *overlap = NULL; while (head_node) { tail_node = tail->heads->head; while (tail_node) { if ((strcmp(head_node->substring->substring, tail_node->substring->substring) == 0) && (!overlap || (strlen(overlap) < strlen(head_node->substring->substring)))) overlap = head_node->substring->substring; tail_node = tail_node->next; } head_node = head_node->next; } return overlap; } /* Combine two aggregate titles. */ aggregate_title *combine_titles(aggregate_title *head,aggregate_title *tail) { aggregate_title *title = NULL; char *overlap = NULL; char *new_string = NULL; if (!shared_titles(head,tail)) { overlap = longest_overlap(head,tail); new_string = (char *)malloc((strlen(head->title) + strlen(tail->title) - strlen(overlap)) * sizeof(char *)); sprintf(new_string,"%s %s",head->title,tail->title + strlen(overlap) + 1); title = new_aggregate_title(new_string); set_head_substrings(title,head->heads); set_tail_substrings(title,tail->tails); add_index_list(title->component_title_indices, head->component_title_indices); add_index_list(title->component_title_indices, tail->component_title_indices); } return title; }

The combine_titles function does a lot of the heavy lifting. It creates new aggregate titles, sets the head and tail substrings appropriately, and updates the back pointers from those substrings.

Finding head overlaps is very similar:

/* * Find all titles that overlap with the heads of the specified * aggregate_title, create new aggregate_titles from them, and add them * to the specified aggregate_title_list. */ void find_head_overlaps(aggregate_title *title,aggregate_title_list *list) { title_substring_node *substring_node = NULL; aggregate_title_node *title_node = NULL; aggregate_title *aggregate_title = NULL; substring_node = title->heads->head; while (substring_node) { title_node = substring_node->substring->tails->head; while (title_node) { if (aggregate_title = combine_titles(title_node->title,title)) add_aggregate_title(list,aggregate_title); title_node = title_node->next; } substring_node = substring_node->next; } } /* * Create a population of all aggregate titles that can be generated * from the specified list of titles. The caller is responsible for * freeing the memory for the returned array. */ aggregate_title_list *create_title_population(char **titles) { aggregate_title_list *population = new_aggregate_title_list(); title_substring_list *title_subtitles = new_title_substring_list(); aggregate_title *title = NULL; long count = 0; while (*titles) { fprintf(stdout, "create_title_population(): processing title %s\n", *titles); title = new_base_title(*titles++,count++); set_head_substrings(title, title_substrings(title_subtitles, head_substrings(title->title))); set_tail_substrings(title, title_substrings(title_subtitles, tail_substrings(title->title))); find_head_overlaps(title,population); find_tail_overlaps(title,population); add_aggregate_title(population,title); } return population; }

After create_title_population completes, the solution can be read from the population aggregate title list:

/* Return the number of indices in the specified aggregate_title. */ int title_index_count(aggregate_title *aggregate) { title_index_list *list = aggregate->component_title_indices; title_index_node *node = list->head; int count = 0; while (node) { ++count; node = node->next; } return count; } /* Write an aggregate title and its components to stdout. */ void dump_aggregate_title(aggregate_title *title,char **base_titles) { title_index_node *title_index = NULL; fprintf(stdout,"Aggregate title: %s\n",title->title); title_index = title->component_title_indices->head; while (title_index) { fprintf(stdout,"\t%s\n",base_titles[title_index->index]); title_index = title_index->next; } } /* Find the aggregate title composed of the most titles. */ aggregate_title *find_longest_title(aggregate_title_list *list,char **titles) { aggregate_title *longest = NULL; aggregate_title_node *node = list->head; int max_index_count = 0; int index_count; while (node) { if ((index_count = title_index_count(node->title)) > max_index_count) { longest = node->title; max_index_count = index_count; dump_aggregate_title(node->title,titles); } else if (index_count == max_index_count) dump_aggregate_title(node->title,titles); node = node->next; } return longest; } /* Solve the puzzle. */ int main(int argc,char *argv[]) { char **titles = NULL; aggregate_title_list *population = NULL; aggregate_title *longest_title = NULL; if (argc == 2) { titles = load_titles(argv[1]); population = create_title_population(titles); longest_title = find_longest_title(population,titles); if (longest_title) dump_aggregate_title(longest_title,titles); else fprintf(stderr,"Null aggregate title found.\n"); } else fprintf(stderr,"Usage: %s \n",argv[0]); }

Answer

This code found eight aggregate titles composed of twelve base titles each. My personal favorite is Who's The Man From Snowy River Of No Return To Me Myself I Went Down To Earth Girls Are Easy Money For Nothing But Trouble Bound, made up of:

All source is available in the file movie.c.

Paths Not Taken

In retrospect, I suspect that building a directed acyclic graph instead of a population might have been easier. If this were production code, I would place each cohesive set of structures and utility functions into separate source files and create a makefile to build the executable. I would also use a generic library for my linked lists and would run tools like Valgrind or Purify on the code.

That was fun. I'm going back to Lisp now....

Source Code

Please contact me by email if you have any questions, comments, or suggestions.

www.softwarematters.org