Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to represent text documents as feature vectors for text classification?

I have around 10,000 text documents.

How to represent them as feature vectors, so that I can use them for text classification?

Is there any tool which does the feature vector representation automatically?

like image 942
tina Avatar asked Feb 14 '12 08:02

tina


People also ask

How do you represent a feature vector?

An example of a feature vector you might be familiar with is RGB (red-green-blue) color descriptions. A color can be described by how much red, blue, and green there is in it. A feature vector for this would be color = [R, G, B].

What is a feature vector in NLP?

Word2Vec is widely used in most of the NLP models. It transforms the word into vectors. Word2vec is a two-layer net that processes text with words. The input is in the text corpus and the output is a set of vectors: feature vectors represent the words on that corpus.


2 Answers

The easiest approach is to go with the bag of words model. You represent each document as an unordered collection of words.

You probably want to strip out punctuation and you may want to ignore case. You might also want to remove common words like 'and', 'or' and 'the'.

To adapt this into a feature vector you could choose (say) 10,000 representative words from your sample, and have a binary vector v[i,j] = 1 if document i contains word j and v[i,j] = 0 otherwise.

like image 162
Chris Taylor Avatar answered Oct 24 '22 02:10

Chris Taylor


To give a really good answer to the question, it would be helpful to know, what kind of classification you are interested in: based on genre, author, sentiment etc. For stylistic classification for example, the function words are important, for a classification based on content they are just noise and are usually filtered out using a stop word list. If you are interested in a classification based on content, you may want to use a weighting scheme like term frequency / inverse document frequency,(1) in order to give words which are typical for a document and comparetively rare in the whole text collection more weight. This assumes a vector space model of your texts which is a bag of word representation of the text. (See Wikipedia on Vector Space Modell and tf/idf) Usually tf/idf will yield better results than a binary classification schema which only contains the information whether a term exists in a document.

This approach is so established and common that machine learning libraries like Python's scikit-learn offer convenience methods which convert the text collection into a matrix using tf/idf as a weighting scheme.


like image 34
fotis j Avatar answered Oct 24 '22 02:10

fotis j