I am trying to build a recommender system which would recommend webpages to the user based on his actions(google search, clicks, he can also explicitly rate webpages). To get an idea the way google news does it, it displays news articles from the web on a particular topic. In technical terms that is clustering, but my aim is similar. It will be content based recommendation based on user's action. So my questions are: <ol> <li>How can I possibly trawl the internet to find related web-pages?</li> <li>And what algorithm should I use to extract data from web-page is textual analysis and word frequency the only way to do it?</li> <li>Lastly what platform is best suited for this problem. I have heard of Apache mahout and it comes with some re-usable algos, does it sound like a good fit? </li> </ol>

as Thomas Jungblut said, one could write several books on your questions ;-) I will try to give you a list of brief pointers - but be aware there will be no ready-to-use off-the-shelf solution ... <ol> <li> Crawling the internet: There are plenty of toolkits for doing this, like Scrapy for Python , crawler4j and Heritrix for Java, or WWW::Robot for Perl. For extracting the actual content from web pages, have a look at boilerpipe. http://scrapy.org/ http://crawler.archive.org/ http://code.google.com/p/crawler4j/ https://metacpan.org/module/WWW::Robot http://code.google.com/p/boilerpipe/ </li> <li> First of all, often you can use collaborative filtering instead of content-based approaches. But if you want to have good coverage, especially in the long tail, there will be no way around analyzing the text. One thing to look at is topic modelling, e.g. LDA. Several LDA approaches are implemented in Mallet, Apache Mahout, and Vowpal Wabbit. For indexing, search, and text processing, have a look at Lucene. It is an awesome, mature piece of software. http://mallet.cs.umass.edu/ http://mahout.apache.org/ http://hunch.net/~vw/ http://lucene.apache.org/ </li> <li> Besides Apache Mahout which also contains things like LDA (see above), clustering, and text processing, there are also other toolkits available if you want to focus on collaborative filtering: LensKit, which is also implemented in Java, and MyMediaLite (disclaimer: I am the main author), which is implemented in C#, but also has a Java port. http://lenskit.grouplens.org/ http://ismll.de/mymedialite https://github.com/jcnewell/MyMediaLiteJava </li> </ol>

Web page recommender system

Tags:

machine-learning

mahout

recommendation-engine

I am trying to build a recommender system which would recommend webpages to the user based on his actions(google search, clicks, he can also explicitly rate webpages). To get an idea the way google news does it, it displays news articles from the web on a particular topic. In technical terms that is clustering, but my aim is similar. It will be content based recommendation based on user's action.

So my questions are:

How can I possibly trawl the internet to find related web-pages?
And what algorithm should I use to extract data from web-page is textual analysis and word frequency the only way to do it?
Lastly what platform is best suited for this problem. I have heard of Apache mahout and it comes with some re-usable algos, does it sound like a good fit?

904

asked Oct 08 '12 09:10

Rajan Soni

2 Answers

as Thomas Jungblut said, one could write several books on your questions ;-) I will try to give you a list of brief pointers - but be aware there will be no ready-to-use off-the-shelf solution ...

Crawling the internet: There are plenty of toolkits for doing this, like Scrapy for Python , crawler4j and Heritrix for Java, or WWW::Robot for Perl. For extracting the actual content from web pages, have a look at boilerpipe.

http://scrapy.org/

http://crawler.archive.org/

http://code.google.com/p/crawler4j/

https://metacpan.org/module/WWW::Robot

http://code.google.com/p/boilerpipe/
First of all, often you can use collaborative filtering instead of content-based approaches. But if you want to have good coverage, especially in the long tail, there will be no way around analyzing the text. One thing to look at is topic modelling, e.g. LDA. Several LDA approaches are implemented in Mallet, Apache Mahout, and Vowpal Wabbit. For indexing, search, and text processing, have a look at Lucene. It is an awesome, mature piece of software.

http://mallet.cs.umass.edu/

http://mahout.apache.org/

http://hunch.net/~vw/

http://lucene.apache.org/
Besides Apache Mahout which also contains things like LDA (see above), clustering, and text processing, there are also other toolkits available if you want to focus on collaborative filtering: LensKit, which is also implemented in Java, and MyMediaLite (disclaimer: I am the main author), which is implemented in C#, but also has a Java port.

http://lenskit.grouplens.org/

http://ismll.de/mymedialite

https://github.com/jcnewell/MyMediaLiteJava

158

answered Oct 05 '22 23:10

zenog

This should be a good read: Google news personalization: scalable online collaborative filtering

It's focused on collaborative filtering rather than content based recommendations, but it touches some very interesting points like scalability, item churn, algorithms, system setup and evaluation.

Mahout has very good collaborative filtering techniques, which is what you describe as using the behaviour of the users (click, read, etc) and you could introduce some content based using the rescorer classes.

You might also want to have a look at Myrrix, which is in some ways the evolution of the taste (aka recommendations) portion of Mahout. In addition, it also allows applying content based logic on top of collaborative filtering using the rescorer classes.

If you are interested in Mahout, the Mahout in Action book would be the best place to start.

answered Oct 06 '22 01:10

Julian Ortega

Related questions
                            
                                Trouble understanding output from scikit random forest
                            
                                How to calculate precision, recall and F-score with libSVM in python
                            
                                Choosing the regularization parameter
                            
                                confusion matrix from rpart
                            
                                Turning a Pandas Dataframe to an array and evaluate Multiple Linear Regression Model
                            
                                How to evaluate cost function for scikit learn LogisticRegression?
                            
                                Calculating prediction accuracy of a tree using rpart's predict method
                            
                                How can i plot a Kmeans text clustering result with matplotlib?
                            
                                Getting Error on StandardScalar Fit_Transform
                            
                                Non linear Regression: Why isn't the model learning?
                            
                                Getting error "Could not import PIL.Image. The use of `array_to_img` requires PIL."
                            
                                How to Setup Adaptive Learning Rate in Keras
                            
                                Loss in Keras Model evaluation
                            
                                SciKitlearn ColumnTransformer TypeError: Cannot clone object. You should provide an instance of scikit-learn estimator instead of a class
                            
                                How to load a keras model saved as .pb
                            
                                How to plot a separator line between two data classes?
                            
                                ANN and SVM classification [closed]
                            
                                Plot SVM in 3 dimension
                            
                                Machine learning: Supervised learning to learn & predict next RSA code
                            
                                Validating Output From a Clustering Algorithm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With