I'd like to write an application that searches Google's Ngram data to return words and phrases that used to be more popular, by some arbitrary percentage, within some arbitrary range of years, than they are now.
For example: https://books.google.com/ngrams/graph?content=cowabunga&year_start=1950&year_end=2000&corpus=15&smoothing=3
Ideally, I'd like to be able to find these words and phrases without specifying them up front. Can anyone help me come up with a way to do this using a downloaded copy of the Ngrams data?
First step after downloading some n-grams is to dump them into a SQLite3 database. For example, I fetched the 1-grams starting with the letter 't'
To dump them into SQLite, run the command sqlite3 1grams.db
sqlite> create table t1grams (ngram text, year integer, match_count integer, volume_count integer);
sqlite> .separator "\t"
sqlite> .import googlebooks-eng-all-1gram-20120701-t t1grams
Second step is to pick the year range, call them YEAR_START
and YEAR_END
, and your percentage, call it PERCENT_THRESHOLD
.
Your problem reduces to a query where you select those ngram
s such that match_count
is PERCENT_THRESHOLD
% less common at YEAR_END
than at YEAR_START
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With