I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like: <ol> <li>Go through the HTML source of pages from a certain site and "understand" which sections form the content, which the advertisements and which form the metadata ( neither the content, nor the ads - for eg. - TOC, author bio etc )</li> <li>Go through the HTML source of pages from disparate sites and "classify" whether the site belongs to a predefined category or not ( list of categories will be supplied beforhand )1.</li> <li>... similar classification tasks on text and pages.</li> </ol> As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data. As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use? I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature? <blockquote> In that case, what subjects and topics does a computer science graduate need to learn right now, so that the above requirements can be solved, putting these frameworks to use? </blockquote> I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can. My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong. Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required! I will modify this question if needed, depending on all your suggestions and feedback.

Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

Tags:

machine-learning

classification

I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:

Go through the HTML source of pages from a certain site and "understand" which sections form the content, which the advertisements and which form the metadata ( neither the content, nor the ads - for eg. - TOC, author bio etc )
Go through the HTML source of pages from disparate sites and "classify" whether the site belongs to a predefined category or not ( list of categories will be supplied beforhand )1.
... similar classification tasks on text and pages.

As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.

As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?

I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?

In that case, what subjects and topics does a computer science graduate need to learn right now, so that the above requirements can be solved, putting these frameworks to use?

I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.

My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.

Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!

I will modify this question if needed, depending on all your suggestions and feedback.

263

asked Sep 21 '10 21:09

PoorLuzer

3 Answers

"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.

In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.

Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.

While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.

116

answered Nov 15 '22 08:11

Andreas

Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.

answered Nov 15 '22 08:11

Fred Foo

The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.

I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.

For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.

I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.

Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.

From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.

answered Nov 15 '22 07:11

adam

Related questions
                            
                                How does data shape change during Conv2D and Dense in Keras?
                            
                                How to get all confusion matrix terminologies (TPR, FPR, TNR, FNR) for a multi class?
                            
                                How do you use TensorFlow Graphkeys to get all weights?
                            
                                Backward Propagation - Gradient error [Python]
                            
                                Intuition behind Stacking Multiple Conv2D Layers before Dropout in CNN
                            
                                Retrieve final hidden activation layer output from sklearn's MLPClassifier
                            
                                How many combinations will GridSearchCV run for this?
                            
                                Tf.Print() doesn't print the shape of the tensors?
                            
                                Implementing a decision tree using h2o
                            
                                Simple Linear Regression using Keras
                            
                                How to get coefficients and feature importances from MultiOutputRegressor?
                            
                                Why does Spark's Word2Vec return a vector?
                            
                                Multi-label classification Keras metrics
                            
                                How to implement my own ResNet with torch.nn.Sequential in Pytorch?
                            
                                How SelectKBest (chi2) calculates score?
                            
                                Google Colab: Can we restore all the data even after the runtime disconnects?
                            
                                Machine learning regression model predicts same value for every image
                            
                                AttributeError: 'str' object has no attribute 'dim' in pytorch
                            
                                Database of surveillance camera locations
                            
                                Using the Apache Mahout machine learning libraries [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With