Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python NLP: Google ngram API

Tags:

python

nlp

n-gram

I'm working on a Python NLP task where I need to prune out non-technical/very common noun phrases from a list of noun phrases that contains noise. Here is an example:

["people", "US presidents", "New York City", "electric cars", "vegan food", "the best"]

I need to prune out "people" and "the best". I want to do this using an ngram dataset: the frequency of 'people' and 'the best' is much higher than that of any other noun phrase, so it would be possible to label them as outliers and prune them out. The Google ngram dataset is well suited for this purpose:

url = "https://books.google.com/ngrams/json"

query_params = {
        "content": <my_noun_phrase/string of noun phrases>,
        "year_start": 2017,
        "year_end": 2019,
        "corpus": 26,
        "smoothing": 1,
        "case_insensitive": True
    }
response = requests.get(url=url, params=query_params)

But sadly their API (which is undocumented) can't handle a lot of traffic - I often get 429 errors (too many requests). Is there a better way to interact with the Google ngram API? Or does anyone know other APIs/web services that provide the same functionality (i.e. allow users to retrieve term frequency data for multi-word expressions from a very large corpus)? Thanks in advance!

like image 629
mr_faulty Avatar asked Jun 05 '26 23:06

mr_faulty


1 Answers

There is also NGRAMS which lets you search version 3 of this dataset. It also has a REST API.

like image 110
Martin Trenkmann Avatar answered Jun 08 '26 14:06

Martin Trenkmann



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!