Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature extraction from a single word

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification

But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.

What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?

I will not make the implementation in English, so I can't use databases.

like image 628
user1506145 Avatar asked Feb 11 '13 11:02

user1506145


2 Answers

hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.

In NLP exists 3 basic levels:

  1. morphological analyses
  2. syntactic analyses
  3. semantic analyses

(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.

As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...

like image 89
xhudik Avatar answered Sep 28 '22 04:09

xhudik


user1506145,

Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let

I = the number of documents,

J = the number of words,

x_ij = the number of times the jth word appears in the ith document, and

y_ij = ln( 1+ x_ij).

Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.

You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.

like image 45
Hans Scundal Avatar answered Sep 28 '22 03:09

Hans Scundal