I'm using Python to parse urls into words. I am having some success but I am trying to cut down on ambiguity. For example, I am given the following url
"abbeycarsuk.com"
and my algorithm outputs:
['abbey','car','suk'],['abbey','cars','uk']
Clearly the second parsing is the correct one, but the first one is also technically just as correct (apparently 'suk' is a word in the dictionary that I am using).
What would help me out a lot is if there is a wordlist out there that also contains the fequency/popularity of each word. I could work this into my algorithm and then the second parsing would be chosen (since 'uk' is obviously more common than 'suk'). Does anyone know where I could find such a list? I found wordfrequency.info but they charge for the data, and the free sample they offer does not have enough words for me to be able to use it successfully.
Alternatively, I suppose I could download a large corpus (project Gutenberg?) and get the frequency values myself, however if such a data set already exists, it would make my life a lot easier.
There is an extensive article on this very subject written by Peter Norvig (Google's head of research), which contains worked examples in Python, and is fairly easy to understand. The article, along with the data used in the sample programs (some excerpts of Google ngram data) can be found here. The complete set of Google ngrams, for several languages, can be found here (free to download if you live in the east of the US).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With