I have around 10,000 text documents.
How to represent them as feature vectors, so that I can use them for text classification?
Is there any tool which does the feature vector representation automatically?
An example of a feature vector you might be familiar with is RGB (red-green-blue) color descriptions. A color can be described by how much red, blue, and green there is in it. A feature vector for this would be color = [R, G, B].
Word2Vec is widely used in most of the NLP models. It transforms the word into vectors. Word2vec is a two-layer net that processes text with words. The input is in the text corpus and the output is a set of vectors: feature vectors represent the words on that corpus.
The easiest approach is to go with the bag of words model. You represent each document as an unordered collection of words.
You probably want to strip out punctuation and you may want to ignore case. You might also want to remove common words like 'and', 'or' and 'the'.
To adapt this into a feature vector you could choose (say) 10,000 representative words from your sample, and have a binary vector v[i,j] = 1
if document i
contains word j
and v[i,j] = 0
otherwise.
To give a really good answer to the question, it would be helpful to know, what kind of classification you are interested in: based on genre, author, sentiment etc. For stylistic classification for example, the function words are important, for a classification based on content they are just noise and are usually filtered out using a stop word list. If you are interested in a classification based on content, you may want to use a weighting scheme like term frequency / inverse document frequency,(1) in order to give words which are typical for a document and comparetively rare in the whole text collection more weight. This assumes a vector space model of your texts which is a bag of word representation of the text. (See Wikipedia on Vector Space Modell and tf/idf) Usually tf/idf will yield better results than a binary classification schema which only contains the information whether a term exists in a document.
This approach is so established and common that machine learning libraries like Python's scikit-learn offer convenience methods which convert the text collection into a matrix using tf/idf as a weighting scheme.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With