Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google N-Gram Web API [closed]

I wish to use Google 2-grams for my project; but the data size renders searching expensive both in terms of speed and storage.
Is there a Web-API available for this purpose (in any language) ? The website http://books.google.com/ngrams/graph renders an image, can I get data values?

like image 478
Five Avatar asked Jun 29 '12 11:06

Five


People also ask

How reliable is Google Ngram?

Although Google Ngram Viewer claims that the results are reliable from 1800 onwards, poor OCR and insufficient data mean that frequencies given for languages such as Chinese may only be accurate from 1970 onward, with earlier parts of the corpus showing no results at all for common terms, and data for some years ...

How does Google Ngrams work?

Google Ngram is a search engine that charts word frequencies from a large corpus of books that were printed between 1500 and 2008. The tool generates charts by dividing the number of a word's yearly appearances by the total number of words in the corpus in that year.

What do the percentages mean in Google Ngram?

The percentages on the Y-axis in Google Ngram represent the percent of keywords in Google's sample of books, written in English and published in the United States, that are the target keyword. For example, searching Google Ngram for "the" shows that "the" makes up 4.2% of modern published text.

What is an Ngram search?

N-gram tokenizeredit. The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length. N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length ...


2 Answers

Well, I got a round about way of doing that, using Google BigQuery
In that, trigrams are available in public domain. Using Command line access did the job for me.

like image 187
Five Avatar answered Oct 04 '22 01:10

Five


I found a great alternative: Microsoft Web N-Gram

It can be queried in different ways, including a straighforward GET call through the REST interface. For instance, calling the URL:

http://weblm.research.microsoft.com/weblm/rest.svc/bing-body/apr10/1/jp?u={YOUR_TOKEN}&p=red+panda

returns

-9.005

which is the log likelihood of the phrase red panda.

Furthermore, it is handier than Google N-Grams, as for a given phrase it does not simply output its absolute frequency, but it can output its joint probability, conditional probability and even the most likely words that follow.

Disclaimer: I am not a Microsoft employee, I simply think that I just found an awesome service.

like image 36
Alphaaa Avatar answered Oct 04 '22 00:10

Alphaaa