Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I want a machine to learn to categorize short texts

Tags:

I have a ton of short stories about 500 words long and I want to categorize them into one of, let's say, 20 categories:

  • Entertainment
  • Food
  • Music
  • etc

I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What's the best way to approach this? Is there a standard approach to machine learning I should be using? I don't think a decision tree would work well since it's text data...I'm completely new in this field.

Any help would be appreciated, thanks!

like image 864
atp Avatar asked Apr 23 '10 05:04

atp


People also ask

How do you classify text into categories?

Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule consists of an antecedent or pattern and a predicted category.

What type of machine learning can use for text classification problems?

Support vector machine (SVM) is a widely used text classification method. It is a machine learning method based on statistical learning theory. It was first proposed for binary classification problems.

Which machine learning algorithm is best for text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

How is machine learning used in text classification?

Text classification is a machine learning technique that automatically assigns tags or categories to text. Using natural language processing (NLP), text classifiers can analyze and sort text by sentiment, topic, and customer intent – faster and more accurately than humans.


1 Answers

A naive Bayes will most probably work for you. The method is like this:

  • Fix a number of categories and get a training data set of (document, category) pairs.
  • A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like "the", "and" and such. Each word gets a fixed component of your data vector (e.g. "food" is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.

Training:

  • For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
  • Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word "food" is in the text) / number of documents in given class.

Decision:

  • Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, ..., F500) = P(C) * P(F1|C) * P(F2|C) * ... * P(F500|C). Pick the C that maximizes this term.
  • Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, ..., F500) = log P(C) + log P(F1|C) + log P(F2|C) + ... + log P(F500|C).
like image 54
bayer Avatar answered Sep 19 '22 22:09

bayer