Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out wether a word exists in english using nltk

I am looking for a proper solution to this question. This question has been asked many times before and I didn't find a single answer that suited. I need to use a corpus in NLTK to detect whether a word is an English word

I have tried to do :

wordnet.synsets(word)

This doesn't work for many common words. Using a list of words in English and performing lookup in a file is not an option. Using enchant is not an option either. If there is another library that can do the same, please provide the usage of the api. If not, please provide a corpus in nltk which has all the words in English.

like image 893
akshitBhatia Avatar asked Mar 17 '15 12:03

akshitBhatia


People also ask

How do I know if a word is in English or not in Python?

Using isalpha method for checking Word is in English or not In Python, string objects have a method called isalpha() which can be defined as => Returns True if all characters in string are alphabetic and Returns False if even single character in string is not alphabetic.

How do you know if a word is English?

If you are using the word while talking to people who speak the same as you and they understand it as an English word then it is an English word, in your dialect. If you hear people using it in another dialect too, then it has a broader appeal.

How do you check if a word is a verb?

Verbs always tell the time (also called the tense) of the sentence. The easiest way to find a verb in a sentence is to change the time of the sentence and find the word that changes.

How many English words are there?

We considered dusting off the dictionary and going from A1 to Zyzzyva, however, there are an estimated 171,146 words currently in use in the English language, according to the Oxford English Dictionary, not to mention 47,156 obsolete words.


2 Answers

NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is the /usr/share/dict/words file from Unix, used by some spell checkers. We can use it to find unusual or mis-spelt words in a text corpus, as shown in :

def unusual_words(text):
    text_vocab = set(w.lower() for w in text.split() if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

And in this case you can check the member ship of your word with english_vocab.

>>> import nltk
>>> english_vocab = set(w.lower() for w in nltk.corpus.words.words())
>>> 'a' in english_vocab
True
>>> 'this' in english_vocab
True
>>> 'nothing' in english_vocab
True
>>> 'nothingg' in english_vocab
False
>>> 'corpus' in english_vocab
True
>>> 'Terminology'.lower() in english_vocab
True
>>> 'sorted' in english_vocab
True
like image 83
Mazdak Avatar answered Sep 28 '22 21:09

Mazdak


I tried the above approach but for many words which should exist so I tried wordnet. I think this have more comprehensive vacabulary.-

from nltk.corpus import wordnet if wordnet.synsets(word): #Do something else: #Do some otherthing

like image 33
Saurabh Malviya Avatar answered Sep 28 '22 21:09

Saurabh Malviya