How does Evernote's Web Clipper plugin or Announcify plugin only get relevant article/post/content part of the page? Here is an screenshot from evernote plugin:
No matter which website you visit which is completely different from other layout wises, these are always able to get you article/post/content part of the page.
Each website has different layouts, some have sidebar, some don't, different tags, for main/article/content part, some use <article>
or <section>
of HTML5 others use <h1> > <p>
, some use <h2> > <p>
and others don't use at all. So there are different combination of tags as well as layouts of websites.
Can anyone suggest a solution to getting main article/post/content please via Javascript or PHP?
You won't be able to manipulate the URL to get only a portion of the page. So what you'll want to do is grab the page contents via the server-side language of your choice and then parse the HTML.
However, whenever we do, it’s not always required to extract the entire content from a website especially when an article is concerned. So, how do you filter web pages before printing? Here we discuss some straightforward ways to achieve that.
Printing the Content Part of a Web Page With “Print Friendly” Chrome Extension Lastly, we have “Print Friendly”, a free easy to use Chrome extension /Firefox Add-on that lets you print only the article part of a web page without complicating the process.
As an alternative, you could just use a simple <div> and use the jQuery load function to load the whole page and pluck out just the section you want: There may be other things you need to do, and a significant difference is that the content will become part of the main page instead of being segregated into a separate window.
You can do a simple DOM parsing and search for the <div>
s and <p>
s containing more text (text! not HTML code!). However, regardless of the intelligent method you will choose for determining where the content is, you should start from DOM parsing, so let's have a look at DOM parsing PHP libraries.
Anyway, you can start from this:
http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/
Looks quite good, and gives technical explanations if you want to write something of your own.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With