Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural Language Processing - Features for Text Classification

So I'm trying to classify texts using Weka SVM. So far, my feature vectors used for training the SVM are composed of TF-IDF statistics for unigrams and bigrams that appear in the training texts. But, the results I get from testing the trained SVM model haven't been accurate at all, so can someone give me feedback on my procedure? I am following these steps to classify texts:

  1. Construct a dictionary made up of extracted unigrams and bigrams from the training texts
  2. Count how many times each unigram/bigram appears in each training text, as well as how many training texts the unigram/bigram appears in
  3. Use the data from step 2 to calcuate the TF-IDF for each unigram/bigram
  4. For each document, construct a feature vector that is the length of the dictionary, and store the corresponding TF-IDF statistic in each element of the vector (so for example, the first element in the feature vector for document one would correspond to the TF-IDF for the first word in the dictionary relative to document one)
  5. Append class label to each feature vector to distinguish which text belongs to which author
  6. Train SVM using these feature vectors
  7. Feature vectors for the testing texts are constructed in the same way as the training texts, and are classified by the SVM

Also, could it be that I need to train the SVM with more features? If so, what features are most effective in this case? Any help would be greatly appreciated, thanks.

like image 526
myrocks2 Avatar asked Jun 07 '13 21:06

myrocks2


People also ask

What are features of a text NLP?

Some of the common features that we can extract from a sentence are the number of words, number of capital words, number of punctuation, number of unique words, number of stopwords, average sentence length, etc. We can define these features based on our data set we are using.

How is NLP used in text classification?

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

What are features in text classification?

Feature selection methods can be classified into 4 categories. Filter, Wrapper, Embedded , and Hybrid methods. Filter perform a statistical analysis over the feature space to select a discriminative subset of features.

What are the possible features of a text corpus in NLP?

Count of the word 2. Identifying stop words 3. Predicting parts of Speech 4.


2 Answers

Natural language documents normally contain many words that only appear once, also known as Hapax Legomenon. For example, 44% of distinct words in Moby-Dick only appear once, and 17% twice.

Therefore, including all words from a corpus normally results in an excessive amount of features. In order to reduce the size of this feature space, NLP systems typically employ one or more of the following:

  • Removal of Stop Words -- for author classification, these are typically short and common words such as is, the, at, which, and so on.
  • Stemming -- popular stemmers (such as the Porter stemmer) use a set of rules to normalize the inflection of a word. E.g., walk, walking and walks are all mapped to the stem walk.
  • Correlation/Significance Threshold -- Compute the Pearson Correlation Coefficient or the p-value of each feature with respect to the class label. Then set a threshold, and remove all feature that score a value below that threshold.
  • Coverage Threshold -- similar to the above threshold, remove all features that do not appear in at least t documents, where t is very small (< 0.05%) with respect to the entire corpus size.
  • Filtering based on the part of speech -- for example, only considering verbs, or removing nouns.
  • Filtering based on the type of system -- for example, a NLP system for clinical text may only consider words that are found in a medical dictionary.

For stemming, removing stop words, indexing the corpus, and computing tf_idf or document similarity, I would recommend using Lucene. Google "Lucene in 5 minutes" for some quick and easy tutorials on using lucene.

like image 59
Matthew Wiley Avatar answered Oct 07 '22 13:10

Matthew Wiley


In these types of classification it is important that your vector is not very large, because you can get a lot of zeros in it and that could have bad impact on results because these vectors are too close and it is hard to separate them correctly. Also i would reccomend you not to use every bigrams, choose some with the highest frequency(in you text) to reduce size of your vector and keep enough information. Some artile why it is reccomended : http://en.wikipedia.org/wiki/Curse_of_dimensionality And last but also important is how much data you have, the bigger your vector is the more data you should have.

like image 37
Kazenga Avatar answered Oct 07 '22 11:10

Kazenga