Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Availability of a list with English words (including frequencies)? [closed]

I'm using Python to parse urls into words. I am having some success but I am trying to cut down on ambiguity. For example, I am given the following url

"abbeycarsuk.com"

and my algorithm outputs:

['abbey','car','suk'],['abbey','cars','uk']

Clearly the second parsing is the correct one, but the first one is also technically just as correct (apparently 'suk' is a word in the dictionary that I am using).

What would help me out a lot is if there is a wordlist out there that also contains the fequency/popularity of each word. I could work this into my algorithm and then the second parsing would be chosen (since 'uk' is obviously more common than 'suk'). Does anyone know where I could find such a list? I found wordfrequency.info but they charge for the data, and the free sample they offer does not have enough words for me to be able to use it successfully.

Alternatively, I suppose I could download a large corpus (project Gutenberg?) and get the frequency values myself, however if such a data set already exists, it would make my life a lot easier.

like image 359
user1893354 Avatar asked Dec 15 '22 08:12

user1893354


1 Answers

There is an extensive article on this very subject written by Peter Norvig (Google's head of research), which contains worked examples in Python, and is fairly easy to understand. The article, along with the data used in the sample programs (some excerpts of Google ngram data) can be found here. The complete set of Google ngrams, for several languages, can be found here (free to download if you live in the east of the US).

like image 69
michaelmeyer Avatar answered Feb 01 '23 12:02

michaelmeyer