Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any best practice to prepare features for text-based classification?

We have many feedback and issue reports from customers. And they are plain texts. We are trying to build a auto classifier for these docs so future feedback/issues could be auto routed to the correct support team. Besides the text itself, I think we should include things like customer profile, case submit region, etc into the classifier. I think this could provide more clues for classifier to make better predictions.

Currently, all the features selected for training are based on the text content. How to include the above mentioned meta-features?

(BTW, I am new to this. So excuse me if this question is a trivial one.)

ADD 1

My current approach is to first do some typical pre-processing to the raw text (including title and body), such as remove the stop words, POS-tagging and extract significant words. Then I convert the title and body into a list of words and store them in some sparse format as below:

instance 1: word1:word1 count, word2: word2 count, ....

instance 2: wordX:word1 count, wordY: word2 count, ....

And for the other non-text features, I am planning to add them as new columns after the word columns. So a final instance will look like:

instance 1: word1:word1 count, ... , feature X:value, feature Y:value

like image 455
smwikipedia Avatar asked Feb 28 '14 05:02

smwikipedia


1 Answers

  1. if the costomer profile data is binary value(eg. gender of the customer), the feature can be desigend as 0,1 where 0 represent male, 1 represent female. when the feature have multi values, like the submit region (suppose we have five region here). we should designed it as a feature vector with five dimensions such as [ 0 0 1 0 0], each dimension from the vector represent whether this post is from this specific region. this way is better in practice instead of using a feature with multiple value when using classifier like logistic regression

  2. you are using feature called bag of words representation. since bag of words is the tf of word in the document, but should a word with higher tf be more important than words with lower tf. I think it is not. in practice, tf*idf show better performance.

    idf(inverse document frequency) is a way to estimate how important the word is, usually, document frequency (df) is a good way to estimate how important a word in classfication, since when a word appear in less document (nba would always appear in documents belong to sports) show a better descrimination, so idf is in positive correlation to word's importance.

like image 168
michaeltang Avatar answered Oct 12 '22 15:10

michaeltang