<blockquote> Possible Duplicate: Text Classification into Categories </blockquote> I am currently working on a solution to get the type of food served in a database with 10k restaurants based on their description. I'm using lists of keywords to decide which kind of food is being served. I read a little bit about machine learning but I have no practical experience with it at all. Can anyone explain to me if/why it would a be better solution to a simple problem like this? I find accuracy more important than performance! simplified example: <pre class="prettyprint"><code>["China", "Chinese", "Rice", "Noodles", "Soybeans"] ["Belgium", "Belgian", "Fries", "Waffles", "Waterzooi"] </code></pre> a possible description could be: "Hong's Garden Restaurant offers savory, reasonably priced Chinese to our customers. If you find that you have a sudden craving for rice, noodles or soybeans at 8 o’clock on a Saturday evening, don’t worry! We’re open seven days a week and offer carryout service. You can get fries here as well!"

You are indeed describing a classification problem, which can be solved with machine learning. In this problem, your features are the words in the description. You should use the Bag Of Words model - which basically says that the words and their number of occurrences for each word is what matters to the classification process. To solve your problem, here are the steps you should do: <ol> <li> Create a feature extractor - that given a description of a restaurant, returns the "features" (under the Bag Of Words model explained above) of this restaurant (denoted as example in the literature).</li> <li> Manually label a set of examples, each will be labeled with the desired class (Chinese, Belgian, Junk food,...)</li> <li>Feed your labeled examples into a learning algorithm. It will generate a classifier. From personal experience, SVM usually gives the best results, but there are other choices such as Naive Bayes, Neural Networks and Decision Trees (usually C4.5 is used), each has its own advantage.</li> <li>When a new (unlabeled) example (restaurant) comes - extract the features and feed it to your classifier - it will tell you what it thinks it is (and usually - what is the probability the classifier is correct).</li> </ol> <hr> Evaluation: Evaluation of your algorithm can be done with cross-validation, or seperating a test set out of your labeled examples that will be used only for evaluating how accurate the algorithm is. <hr> Optimizations: From personal experience - here are some optimizations I found helpful for the feature extraction: <ol> <li> Stemming and eliminating stop words usually helps a lot.</li> <li>Using Bi-Grams tends to improve accuracy (though increases the feature space significantly).</li> <li>Some classifiers are prone to large feature space (SVM not included), there are some ways to overcome it, such as decreasing the dimensionality of your features. PCA is one thing that can help you with it. Genethic Algorithms are also (empirically) pretty good for subset selection.</li> </ol> <hr> Libraries: Unfortunately, I am not fluent enough with python, but here are some libraries that might be helpful: <ul> <li> Lucene might help you a lot with the text analysis, for example - stemming can be done with EnglishAnalyzer. There is a python version of lucene called PyLucene, which I believe might help you out.</li> <li> Weka is an open source library that implements a lot of useful things for Machine Learning - many classifiers and feature selectors included.</li> <li> Libsvm is a library that implements the SVM algorithm.</li> </ul>

Very simple text classification by machine learning? [duplicate]

Tags:

python

algorithm

machine-learning

text-analysis

Possible Duplicate:
Text Classification into Categories

I am currently working on a solution to get the type of food served in a database with 10k restaurants based on their description. I'm using lists of keywords to decide which kind of food is being served.

I read a little bit about machine learning but I have no practical experience with it at all. Can anyone explain to me if/why it would a be better solution to a simple problem like this? I find accuracy more important than performance!

simplified example:

["China", "Chinese", "Rice", "Noodles", "Soybeans"]
["Belgium", "Belgian", "Fries", "Waffles", "Waterzooi"]

a possible description could be:

"Hong's Garden Restaurant offers savory, reasonably priced Chinese to our customers. If you find that you have a sudden craving for rice, noodles or soybeans at 8 o’clock on a Saturday evening, don’t worry! We’re open seven days a week and offer carryout service. You can get fries here as well!"

703

asked Dec 09 '12 14:12

Dieter

1 Answers

You are indeed describing a classification problem, which can be solved with machine learning.

In this problem, your features are the words in the description. You should use the Bag Of Words model - which basically says that the words and their number of occurrences for each word is what matters to the classification process.

To solve your problem, here are the steps you should do:

Create a feature extractor - that given a description of a restaurant, returns the "features" (under the Bag Of Words model explained above) of this restaurant (denoted as example in the literature).
Manually label a set of examples, each will be labeled with the desired class (Chinese, Belgian, Junk food,...)
Feed your labeled examples into a learning algorithm. It will generate a classifier. From personal experience, SVM usually gives the best results, but there are other choices such as Naive Bayes, Neural Networks and Decision Trees (usually C4.5 is used), each has its own advantage.
When a new (unlabeled) example (restaurant) comes - extract the features and feed it to your classifier - it will tell you what it thinks it is (and usually - what is the probability the classifier is correct).

Evaluation:
Evaluation of your algorithm can be done with cross-validation, or seperating a test set out of your labeled examples that will be used only for evaluating how accurate the algorithm is.

Optimizations:

From personal experience - here are some optimizations I found helpful for the feature extraction:

Stemming and eliminating stop words usually helps a lot.
Using Bi-Grams tends to improve accuracy (though increases the feature space significantly).
Some classifiers are prone to large feature space (SVM not included), there are some ways to overcome it, such as decreasing the dimensionality of your features. PCA is one thing that can help you with it. Genethic Algorithms are also (empirically) pretty good for subset selection.

Libraries:

Unfortunately, I am not fluent enough with python, but here are some libraries that might be helpful:

Lucene might help you a lot with the text analysis, for example - stemming can be done with EnglishAnalyzer. There is a python version of lucene called PyLucene, which I believe might help you out.
Weka is an open source library that implements a lot of useful things for Machine Learning - many classifiers and feature selectors included.
Libsvm is a library that implements the SVM algorithm.

155

answered Sep 19 '22 17:09

amit

Related questions
                            
                                Alter elements of a list
                            
                                random Decimal in python
                            
                                How to designate unreachable python code
                            
                                Closing file opened by ConfigParser
                            
                                Python @property versus method performance - which one to use?
                            
                                Dividing a string at various punctuation marks using split()
                            
                                How To Capture Output of Curl from Python script
                            
                                How to read Excel files from a stream (not a disk-backed file) in Python?
                            
                                what is the best way to generate a reset token in python?
                            
                                No web processes running Django in heroku
                            
                                OpenCV install opencv_contrib on Windows
                            
                                Get timestamp in seconds from python's datetime
                            
                                How to sum columns of an array in Python
                            
                                Round time to nearest hour python
                            
                                Installing anaconda with pyenv, unable to configure virtual environment
                            
                                No module named 'sklearn.utils.linear_assignment_'
                            
                                What is the purpose of a zip function (as in Python or C# 4.0)?
                            
                                ImportError: No module named mime.multipart
                            
                                Python: how to build a dict from plain list of keys and values
                            
                                How to generate random numbers that are different? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With