So I'm trying to classify texts using Weka SVM. So far, my feature vectors used for training the SVM are composed of TF-IDF statistics for unigrams and bigrams that appear in the training texts. But, the results I get from testing the trained SVM model haven't been accurate at all, so can someone give me feedback on my procedure? I am following these steps to classify texts: <ol> <li>Construct a dictionary made up of extracted unigrams and bigrams from the training texts</li> <li>Count how many times each unigram/bigram appears in each training text, as well as how many training texts the unigram/bigram appears in</li> <li>Use the data from step 2 to calcuate the TF-IDF for each unigram/bigram </li> <li>For each document, construct a feature vector that is the length of the dictionary, and store the corresponding TF-IDF statistic in each element of the vector (so for example, the first element in the feature vector for document one would correspond to the TF-IDF for the first word in the dictionary relative to document one)</li> <li>Append class label to each feature vector to distinguish which text belongs to which author</li> <li>Train SVM using these feature vectors</li> <li>Feature vectors for the testing texts are constructed in the same way as the training texts, and are classified by the SVM</li> </ol> Also, could it be that I need to train the SVM with more features? If so, what features are most effective in this case? Any help would be greatly appreciated, thanks.

Natural language documents normally contain many words that only appear once, also known as Hapax Legomenon. For example, 44% of distinct words in Moby-Dick only appear once, and 17% twice. Therefore, including all words from a corpus normally results in an excessive amount of features. In order to reduce the size of this feature space, NLP systems typically employ one or more of the following: <ul> <li>Removal of Stop Words -- for author classification, these are typically short and common words such as is, the, at, which, and so on. </li> <li>Stemming -- popular stemmers (such as the Porter stemmer) use a set of rules to normalize the inflection of a word. E.g., walk, walking and walks are all mapped to the stem walk. </li> <li>Correlation/Significance Threshold -- Compute the Pearson Correlation Coefficient or the p-value of each feature with respect to the class label. Then set a threshold, and remove all feature that score a value below that threshold.</li> <li>Coverage Threshold -- similar to the above threshold, remove all features that do not appear in at least t documents, where t is very small (< 0.05%) with respect to the entire corpus size. </li> <li>Filtering based on the part of speech -- for example, only considering verbs, or removing nouns.</li> <li>Filtering based on the type of system -- for example, a NLP system for clinical text may only consider words that are found in a medical dictionary. </li> </ul> For stemming, removing stop words, indexing the corpus, and computing tf_idf or document similarity, I would recommend using Lucene. Google "Lucene in 5 minutes" for some quick and easy tutorials on using lucene.

Natural Language Processing - Features for Text Classification

Tags:

java

nlp

feature-selection

weka

So I'm trying to classify texts using Weka SVM. So far, my feature vectors used for training the SVM are composed of TF-IDF statistics for unigrams and bigrams that appear in the training texts. But, the results I get from testing the trained SVM model haven't been accurate at all, so can someone give me feedback on my procedure? I am following these steps to classify texts:

Construct a dictionary made up of extracted unigrams and bigrams from the training texts
Count how many times each unigram/bigram appears in each training text, as well as how many training texts the unigram/bigram appears in
Use the data from step 2 to calcuate the TF-IDF for each unigram/bigram
For each document, construct a feature vector that is the length of the dictionary, and store the corresponding TF-IDF statistic in each element of the vector (so for example, the first element in the feature vector for document one would correspond to the TF-IDF for the first word in the dictionary relative to document one)
Append class label to each feature vector to distinguish which text belongs to which author
Train SVM using these feature vectors
Feature vectors for the testing texts are constructed in the same way as the training texts, and are classified by the SVM

Also, could it be that I need to train the SVM with more features? If so, what features are most effective in this case? Any help would be greatly appreciated, thanks.

526

asked Jun 07 '13 21:06

myrocks2

2 Answers

Natural language documents normally contain many words that only appear once, also known as Hapax Legomenon. For example, 44% of distinct words in Moby-Dick only appear once, and 17% twice.

Therefore, including all words from a corpus normally results in an excessive amount of features. In order to reduce the size of this feature space, NLP systems typically employ one or more of the following:

Removal of Stop Words -- for author classification, these are typically short and common words such as is, the, at, which, and so on.
Stemming -- popular stemmers (such as the Porter stemmer) use a set of rules to normalize the inflection of a word. E.g., walk, walking and walks are all mapped to the stem walk.
Correlation/Significance Threshold -- Compute the Pearson Correlation Coefficient or the p-value of each feature with respect to the class label. Then set a threshold, and remove all feature that score a value below that threshold.
Coverage Threshold -- similar to the above threshold, remove all features that do not appear in at least t documents, where t is very small (< 0.05%) with respect to the entire corpus size.
Filtering based on the part of speech -- for example, only considering verbs, or removing nouns.
Filtering based on the type of system -- for example, a NLP system for clinical text may only consider words that are found in a medical dictionary.

For stemming, removing stop words, indexing the corpus, and computing tf_idf or document similarity, I would recommend using Lucene. Google "Lucene in 5 minutes" for some quick and easy tutorials on using lucene.

answered Oct 07 '22 13:10

Matthew Wiley

In these types of classification it is important that your vector is not very large, because you can get a lot of zeros in it and that could have bad impact on results because these vectors are too close and it is hard to separate them correctly. Also i would reccomend you not to use every bigrams, choose some with the highest frequency(in you text) to reduce size of your vector and keep enough information. Some artile why it is reccomended : http://en.wikipedia.org/wiki/Curse_of_dimensionality And last but also important is how much data you have, the bigger your vector is the more data you should have.

answered Oct 07 '22 11:10

Kazenga

Related questions
                            
                                How to package test classes into the jar without running them?
                            
                                Cannot cast to type though type
                            
                                Does Java 7 fork/join guarantees executing thread in seperate CPU
                            
                                javax.swing.Timer vs java.util.Timer
                            
                                Does storm replay tuple which processing has timed out?
                            
                                Downloading Multiple Files via SFTP using Java
                            
                                how to resize image while resize frame using drawImage
                            
                                Catching the SOAP Fault error in custom interceptor (Soap12FaultOutInterceptor)
                            
                                Can java labels be used appropriately outside of for loops?
                            
                                Check if a method was called inside another method
                            
                                Submit task which implements a subinterface of Callable<T> to an ExecutorService
                            
                                Unable to evaluate expression in XPath
                            
                                Current thread not owner exception
                            
                                Sonar (jacoco) + jmockit spamming with exceptions
                            
                                What are pros and cons of gwt? And why should I should choose it to use? [closed]
                            
                                Static method without a name
                            
                                Logging when objects are garbage collected
                            
                                Error in setting job.setInputFormatClass in Mapreduce
                            
                                Returning an HTML/XHTML file from servlet
                            
                                Manually sending GET request to a website. 302 redirect error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With