So I'm trying to classify texts using Weka SVM. So far, my feature vectors used for training the SVM are composed of TF-IDF statistics for unigrams and bigrams that appear in the training texts. But, the results I get from testing the trained SVM model haven't been accurate at all, so can someone give me feedback on my procedure? I am following these steps to classify texts:
Also, could it be that I need to train the SVM with more features? If so, what features are most effective in this case? Any help would be greatly appreciated, thanks.
Some of the common features that we can extract from a sentence are the number of words, number of capital words, number of punctuation, number of unique words, number of stopwords, average sentence length, etc. We can define these features based on our data set we are using.
Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.
Feature selection methods can be classified into 4 categories. Filter, Wrapper, Embedded , and Hybrid methods. Filter perform a statistical analysis over the feature space to select a discriminative subset of features.
Count of the word 2. Identifying stop words 3. Predicting parts of Speech 4.
Natural language documents normally contain many words that only appear once, also known as Hapax Legomenon. For example, 44% of distinct words in Moby-Dick only appear once, and 17% twice.
Therefore, including all words from a corpus normally results in an excessive amount of features. In order to reduce the size of this feature space, NLP systems typically employ one or more of the following:
For stemming, removing stop words, indexing the corpus, and computing tf_idf or document similarity, I would recommend using Lucene. Google "Lucene in 5 minutes" for some quick and easy tutorials on using lucene.
In these types of classification it is important that your vector is not very large, because you can get a lot of zeros in it and that could have bad impact on results because these vectors are too close and it is hard to separate them correctly. Also i would reccomend you not to use every bigrams, choose some with the highest frequency(in you text) to reduce size of your vector and keep enough information. Some artile why it is reccomended : http://en.wikipedia.org/wiki/Curse_of_dimensionality And last but also important is how much data you have, the bigger your vector is the more data you should have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With