I want a machine to learn to categorize short texts

Tags:

I have a ton of short stories about 500 words long and I want to categorize them into one of, let's say, 20 categories:

Entertainment
Food
Music
etc

I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What's the best way to approach this? Is there a standard approach to machine learning I should be using? I don't think a decision tree would work well since it's text data...I'm completely new in this field.

Any help would be appreciated, thanks!

864

asked Apr 23 '10 05:04

atp

1 Answers

A naive Bayes will most probably work for you. The method is like this:

Fix a number of categories and get a training data set of (document, category) pairs.
A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like "the", "and" and such. Each word gets a fixed component of your data vector (e.g. "food" is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.

Training:

For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word "food" is in the text) / number of documents in given class.

Decision:

Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, ..., F500) = P(C) * P(F1|C) * P(F2|C) * ... * P(F500|C). Pick the C that maximizes this term.
Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, ..., F500) = log P(C) + log P(F1|C) + log P(F2|C) + ... + log P(F500|C).

answered Sep 19 '22 22:09

bayer

Related questions
                            
                                Developing for iPhone or Android? (As a C# developer) [closed]
                            
                                how to use messages with freemarker in spring mvc?
                            
                                MEF: ComposeParts missing
                            
                                Set a DIV height equal with of another DIV
                            
                                Sort a Javascript Array by frequency and then filter repeats
                            
                                How Do I Call An Inherited JavaScript Constructor With Parameters?
                            
                                Does "Find-Replace whole word only" exist in python?
                            
                                Verify the number of times a protected method is called using Moq
                            
                                Android Development: How To Use onKeyUp?
                            
                                Converting a string in ddMMyyyy format to a DateTime
                            
                                How do I register my Google Account in the android emulator running 2.2 api level 8? [duplicate]
                            
                                Delphi 2010: How to save a whole record to a file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

I want a machine to learn to categorize short texts

Tags:

atp

People also ask

1 Answers

bayer

Recent Activity

Donate For Us