Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get only relevant portion of website

How does Evernote's Web Clipper plugin or Announcify plugin only get relevant article/post/content part of the page? Here is an screenshot from evernote plugin:

enter image description here

No matter which website you visit which is completely different from other layout wises, these are always able to get you article/post/content part of the page.

Each website has different layouts, some have sidebar, some don't, different tags, for main/article/content part, some use <article> or <section> of HTML5 others use <h1> > <p>, some use <h2> > <p> and others don't use at all. So there are different combination of tags as well as layouts of websites.

Can anyone suggest a solution to getting main article/post/content please via Javascript or PHP?

like image 783
Dev555 Avatar asked Feb 09 '12 08:02

Dev555


People also ask

How can I get only a portion of a page?

You won't be able to manipulate the URL to get only a portion of the page. So what you'll want to do is grab the page contents via the server-side language of your choice and then parse the HTML.

Do you have to extract the entire content from a website?

However, whenever we do, it’s not always required to extract the entire content from a website especially when an article is concerned. So, how do you filter web pages before printing? Here we discuss some straightforward ways to achieve that.

How to print only the content of a web page?

Printing the Content Part of a Web Page With “Print Friendly” Chrome Extension Lastly, we have “Print Friendly”, a free easy to use Chrome extension /Firefox Add-on that lets you print only the article part of a web page without complicating the process.

How do I load just a section of a page?

As an alternative, you could just use a simple <div> and use the jQuery load function to load the whole page and pluck out just the section you want: There may be other things you need to do, and a significant difference is that the content will become part of the main page instead of being segregated into a separate window.


1 Answers

You can do a simple DOM parsing and search for the <div>s and <p>s containing more text (text! not HTML code!). However, regardless of the intelligent method you will choose for determining where the content is, you should start from DOM parsing, so let's have a look at DOM parsing PHP libraries.

Anyway, you can start from this:

http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/

Looks quite good, and gives technical explanations if you want to write something of your own.

like image 81
lorenzo-s Avatar answered Sep 22 '22 20:09

lorenzo-s