Python NLP: Google ngram API

Question

I'm working on a Python NLP task where I need to prune out non-technical/very common noun phrases from a list of noun phrases that contains noise. Here is an example:

["people", "US presidents", "New York City", "electric cars", "vegan food", "the best"]

I need to prune out "people" and "the best". I want to do this using an ngram dataset: the frequency of 'people' and 'the best' is much higher than that of any other noun phrase, so it would be possible to label them as outliers and prune them out. The Google ngram dataset is well suited for this purpose:

url = "https://books.google.com/ngrams/json"

query_params = {
        "content": <my_noun_phrase/string of noun phrases>,
        "year_start": 2017,
        "year_end": 2019,
        "corpus": 26,
        "smoothing": 1,
        "case_insensitive": True
    }
response = requests.get(url=url, params=query_params)

But sadly their API (which is undocumented) can't handle a lot of traffic - I often get 429 errors (too many requests). Is there a better way to interact with the Google ngram API? Or does anyone know other APIs/web services that provide the same functionality (i.e. allow users to retrieve term frequency data for multi-word expressions from a very large corpus)? Thanks in advance!

Martin Trenkmann · Accepted Answer

There is also NGRAMS which lets you search version 3 of this dataset. It also has a REST API.

Python NLP: Google ngram API

Tags:

python

nlp

n-gram

mr_faulty

1 Answers

Martin Trenkmann

Recent Activity

Donate For Us

Python NLP: Google ngram API

Tags:

python

nlp

n-gram

mr_faulty

1 Answers

Martin Trenkmann

Related questions

Recent Activity

Donate For Us