I wish to use Google 2-grams for my project; but the data size renders searching expensive both in terms of speed and storage.
Is there a Web-API available for this purpose (in any language) ? The website http://books.google.com/ngrams/graph renders an image, can I get data values?
Although Google Ngram Viewer claims that the results are reliable from 1800 onwards, poor OCR and insufficient data mean that frequencies given for languages such as Chinese may only be accurate from 1970 onward, with earlier parts of the corpus showing no results at all for common terms, and data for some years ...
Google Ngram is a search engine that charts word frequencies from a large corpus of books that were printed between 1500 and 2008. The tool generates charts by dividing the number of a word's yearly appearances by the total number of words in the corpus in that year.
The percentages on the Y-axis in Google Ngram represent the percent of keywords in Google's sample of books, written in English and published in the United States, that are the target keyword. For example, searching Google Ngram for "the" shows that "the" makes up 4.2% of modern published text.
N-gram tokenizeredit. The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length. N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length ...
Well, I got a round about way of doing that, using Google BigQuery
In that, trigrams are available in public domain. Using Command line access did the job for me.
I found a great alternative: Microsoft Web N-Gram
It can be queried in different ways, including a straighforward GET call through the REST interface. For instance, calling the URL:
http://weblm.research.microsoft.com/weblm/rest.svc/bing-body/apr10/1/jp?u={YOUR_TOKEN}&p=red+panda
returns
-9.005
which is the log likelihood of the phrase red panda
.
Furthermore, it is handier than Google N-Grams, as for a given phrase it does not simply output its absolute frequency, but it can output its joint probability, conditional probability and even the most likely words that follow.
Disclaimer: I am not a Microsoft employee, I simply think that I just found an awesome service.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With