Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classifying sentences with overlapping words

I've this CSV file which has comments (tweets, comments). I want to classify them into 4 categories, viz.

  • Pre Sales
  • Post Sales
  • Purchased
  • Service query

Now the problems that I'm facing are these :

  1. There is a huge number of overlapping words between each of the categories, hence using NaiveBayes is failing.
  2. The size of tweets being only 160 chars, what is the best way to prevent words from one category falling into the another.
  3. What all ways should I select the features which can take care of both the 160 char tweets and a bit lengthier facebook comments.
  4. Please let me know of any reference link/tutorial link to follow up the same, being a newbee in this field

Thanks


1 Answers

I wouldn't be so quick to write off Naive Bayes. It does fine in many domains where there are lots of weak clues (as in "overlapping words"), but no absolutes. It all depends on the features you pass it. I'm guessing you are blindly passing it the usual "bag of words" features, perhaps after filtering for stopwords. Well, if that's not working, try a little harder.

A good approach is to read a couple of hundred tweets and see how you know which category you are looking at. That'll tell you what kind of things you need to distill into features. But be sure to look at lots of data, and focus on the general patterns.

An example (but note that I haven't looked at your corpus): Time expressions may be good clues on whether you are pre- or post-sale, but they take some work to detect. Create some features "past expression", "future expression", etc. (in addition to bag-of-words features), and see if that helps. Of course you'll need to figure out how to detect them first, but you don't have to be perfect: You're after anything that can help the classifier make a better guess. "Past tense" would probably be a good feature to try, too.

like image 133
alexis Avatar answered Feb 28 '26 21:02

alexis