Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract meaningful and useful content from web pages? [closed]

I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads, banners, comments etc.) I want to ensure that when a user saves a page, the data that he wanted to read is saved, and nothing else.

In short, I need to build an application which works just like Readability. ( http://www.readability.com ) I need to take this useful content of the web page and store it in a separate file. I don't really know how to go about it.

I don't want to use API's that need me to connect to the internet and fetch data from their servers as the process of data extraction needs to be done offline.

There are two methods that I could think of:

  1. Use a machine learning based algorithm (like this: http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ )

  2. Develop a web scraper that could satisfactorily remove all clutter from web pages.

Is there an existing tool that does this? I came across the boilerpipe library ( http://code.google.com/p/boilerpipe/ ) but didn't use it. Has anybody used it? Does it give satisfactory results? Are there any other tools, particularly written in PHP or Python which do this kind of web scraping?

If I need to build my own tool to do this, what would you guys suggest to go about it?

Since I'd need to clean up messy or incomplete HTML before I begin its parsing, I'd use a tool like Tidy ( http://www.w3.org/People/Raggett/tidy/ ) or Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ) to do the job.

But I don't know how to extract content after this step.

PS. I am an amateur and would love if there were ready to use open source tools that do this, and can be easily integrated into my code that I'll write in PHP or Python. Or if I have to write my own code, I'd love to get guidance who's done such work before! :) Thanks a lot!

like image 845
user1271286 Avatar asked Dec 09 '12 20:12

user1271286


Video Answer


2 Answers

did you type 'python readability' into google? there is a pretty popular (200+ followers) library on github.

https://github.com/buriy/python-readability

Additionally, there is a php one if you were to type 'php readability' though it has 100 followers it has not had activity for almost two years https://github.com/feelinglucky/php-readability

and finally the most popular (350+ github folowers) is the ruby readability port https://github.com/iterationlabs/ruby-readability

At the very least you can see how these 3 different projects accomplish parsing the "important parts" of a webpage.

like image 54
dm03514 Avatar answered Oct 12 '22 09:10

dm03514


You may use htql.

import htql
page="..."
query="&html_main_text"

result=htql.query(page, query)
like image 33
seagulf Avatar answered Oct 12 '22 09:10

seagulf