I have a word, according to that i want to find out whether the text is related to that word or not using python and nltk is it possible ?
For example I have a word called "phosphorous". I would like to find out that the particular text file is related to this word or not?
I cant use bag of words in nltk as I have only one word and no training data.
Any Suggestions?
Thanks in Advance.
Word2vec is a technique for natural language processing published in 2013 by researcher Tomáš Mikolov. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.
Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
Not without a corpus, no.
Look at it this way: can you, an intelligent being, tell whether 光 is related to 部屋に入った時電気をつけました without asking someone or something that actually knows Japanese (assuming you don't know Japanese; if you do, try with "svjetlo" and "Kad je ušao u sobu, upalio je lampu"). If you can't, how do you expect a computer to do it?
And another experiment - can you, an intelligent being, give me the algorithm by which you can teach a non-english-speaking person that "light" is related to "When he entered the room, he turned on the lamp"? Again, no.
tl;dr: You need training data, unless you significantly restrict the meaning of "related" (to "contains", for example).
You can use the nltk wordnet to calculate path similarity score between the word and words in your other text and estimate a heuristics based on that score:
from nltk.corpus import wordnet as wn
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
wn.path_similarity(hit, slap)
You can find more nltk word-net usage examples here: http://www.nltk.org/howto/wordnet.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With