I have around 10,000 text documents. How to represent them as feature vectors, so that I can use them for text classification? Is there any tool which does the feature vector representation automatically?

The easiest approach is to go with the bag of words model. You represent each document as an unordered collection of words. You probably want to strip out punctuation and you may want to ignore case. You might also want to remove common words like 'and', 'or' and 'the'. To adapt this into a feature vector you could choose (say) 10,000 representative words from your sample, and have a binary vector <code>v[i,j] = 1</code> if document <code>i</code> contains word <code>j</code> and <code>v[i,j] = 0</code> otherwise.

How to represent text documents as feature vectors for text classification?

2 Answers

The easiest approach is to go with the bag of words model. You represent each document as an unordered collection of words.

You probably want to strip out punctuation and you may want to ignore case. You might also want to remove common words like 'and', 'or' and 'the'.

To adapt this into a feature vector you could choose (say) 10,000 representative words from your sample, and have a binary vector v[i,j] = 1 if document i contains word j and v[i,j] = 0 otherwise.

162

answered Oct 24 '22 02:10

Chris Taylor

To give a really good answer to the question, it would be helpful to know, what kind of classification you are interested in: based on genre, author, sentiment etc. For stylistic classification for example, the function words are important, for a classification based on content they are just noise and are usually filtered out using a stop word list. If you are interested in a classification based on content, you may want to use a weighting scheme like term frequency / inverse document frequency,(1) in order to give words which are typical for a document and comparetively rare in the whole text collection more weight. This assumes a vector space model of your texts which is a bag of word representation of the text. (See Wikipedia on Vector Space Modell and tf/idf) Usually tf/idf will yield better results than a binary classification schema which only contains the information whether a term exists in a document.

This approach is so established and common that machine learning libraries like Python's scikit-learn offer convenience methods which convert the text collection into a matrix using tf/idf as a weighting scheme.

answered Oct 24 '22 02:10

fotis j

Related questions
                            
                                Text view with different colored texts in xml code
                            
                                Delphi2010: Writing code to assign Caption containing Unicode literal values or load unicode symbols from text file?
                            
                                php json_encode() show's null instead of text
                            
                                Android TextView and getting line of text
                            
                                Stemming with R Text Analysis
                            
                                How do I rotate a label in C#? [duplicate]
                            
                                Error installing package pdftools in R server
                            
                                How do I put variable values into a text string in MATLAB?
                            
                                animation of TextView's text size
                            
                                How can I use the format! macro in a no_std environment?
                            
                                Invalid characters in File.ReadAllText
                            
                                Why is only 64kB of data being saved in my MySQL data column?
                            
                                Get text from web page to string
                            
                                Wrapping / bending a text around a circle in plot (R)
                            
                                Flutter – two text in a row with left one overflowing gracefully
                            
                                How to use an <h2> tag </h2> inside a <p></p> in the middle of a text? [duplicate]
                            
                                Add prefix to every line in text in bash
                            
                                Javascript Libraries for Annotating Text? [closed]
                            
                                How do I include italic text in geom_text_repel or geom_text labels for ggplot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to represent text documents as feature vectors for text classification?

Tags:

text

classification

tina

People also ask

2 Answers

Chris Taylor

fotis j

Recent Activity

Donate For Us