Generating an English word list from Google ngrams
I wrote a program that generates passphrases. Passphrases are a better alternative to commonly-used, hard-to-remember but easy-to-guess passwords ('eyebrow favor advancing homeland' versus 'Tr0ub4dor&3'). The python program randomly picks four common English words to make the passphrase. A key consideration is the list of words the program chooses from. It must be long enough so that the passphrases it outputs are hard to guess, but it cannot be too long or the words will be so uncommon that they are hard to remember. Perhaps the most interesting part of this project was creating a suitable word list.
I began as I do any project, by searching the internet to see what people have done and what's already available. Alas, there was nothing that had all of the characteristics that I wanted. Most of the lists that were out there were too short. I wanted at least 10,000 words so that my passphrases would be harder to guess. Most of the lists of common English words were 1,000 words long or less. So I decided that I would have to compile my own list.
Google has for some time been scanning any book they can get their hands on. They also perform optical character recognition on these books so that Google users can search through their text. Google books Ngrams provides a web application that lets you track the number of appearances of words in books versus the publication dates of the books. This can be a fun toy and perhaps a good research tool, but the true gold mine is in the link at the bottom of the page that proclaims, "Run your own experiment! Raw data is available for download here." There are over 8 million books in this data set, so there is a vast amount of information about words and how they are used over time in the files you can download. Each line in the file has a word or combination of words, the publication year that's being considered, and the number of times that word occurred in the books Google scanned from that year. The sets of words are called Ngrams, where N can be any number. A 1-gram is simply a word. A 2-gram is a combination of two words, like 'analysis is'. 'Analysis is often' would be a 3-gram, and so on. They have the data for up to 5-grams. They also have information on parts of speech. To a linguist, this is a great set of information, but I only need the simplest stuff, the single word occurrences, or if you prefer to sound like a linguist '1-grams'.
I downloaded all of the data on the frequency of single words in the English corpus. There are several gigabytes of data to download and unzip. I then wrote a Python program to read and process the data. With over 6 million lines in the 'q' file alone, there is some processing to do. The program makes a list of the 10,000 most common words in the files you give it to analyze. I set the program up to ignore words less than 3 letters long. I also set it to only consider books that were published since 1980 so as to avoid words like 'wherefore' and the like. I also excluded words that had punctuation or non-letters, since I wouldn't want type those in passphrases. With those parameters defined, the program crawls through each input files adding up the number of occurrences of each word in each year I care about. After processing all the files, the word list is sorted and the top 10,000 are printed out in order to a file.
That completes the final word-list that the passphrase program uses. The last few words in the list are verify, teenagers, and duct. I think these are definitely common enough to appear in passphrases, so the list is not too long. It is fun to look around in the list, words may be surprisingly more or less common than you would have thought. Feel free to use the list for other purposes or generate your own list with your parameters using my program to process Ngram files.
You can discuss this on Hacker News.