Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a database or text file of english words with their different forms

I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project but it doesn't include API in Java.

At this time I am looking for a database or a text file of english words with their different forms. for example:

run running ran ... include including included ... ...

Thank you for your help or advise.

like image 964
Majid Darabi Avatar asked Aug 21 '13 19:08

Majid Darabi


People also ask

How many words are there in the English language?

We considered dusting off the dictionary and going from A1 to Zyzzyva, however, there are an estimated 171,146 words currently in use in the English language, according to the Oxford English Dictionary, not to mention 47,156 obsolete words.

How WordNet works?

WordNet is a large lexical database of English words. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms called 'synsets', each expressing a distinct concept. Synsets are interlinked using conceptual-semantic and lexical relations such as hyponymy and antonymy.


1 Answers

You could download LanguageTool (Disclaimer: I'm the maintainer), which comes with a binary file english.dict. The LanguageTool Wiki describes how to dump that file as a text file:

java -jar morfologik-tools-1.6.0-standalone.jar fsa_dump -x -d english.dict

For run, the file will contain this:

ran run VBD
run run NN
run run VB
run run VBN
run run VBP
running run VBG
runs run NNS
runs run VBZ

The first column is the inflected form, the second is the base form, and the third is the part-of-speech tag according to the (slightly extended) Penn Treebank tagset.

like image 91
Daniel Naber Avatar answered Sep 24 '22 07:09

Daniel Naber