I need help tackling a problem. I need a program which, given a site, finds and extracts the "main" picture, i.e. the one which represents the site. (To say it is the biggest or the first picture is sometimes but not always true).
How should I approach this? Are there any libraries that could help me with this? Thanks!
Scraping images from a website is same as any other attribute from HTML: You need to define your CSS selector by clicking on the html elements or by manually typing the CSS class, element id or tag name. Then just select the extract type as ATTR and value as “src” as in the screenshot below.
OPTION 1
You could checkout Goose. It does something similar to what Pocket and Readability does, i.e. try to extract the main article from a given webpage using a set of heuristics. It can apparently also extract the main image from that article, but it is a bit of a hit and miss, so 60% of the time it works everytime.
It used to be a Java project but rewritten to Scala.
From the readme
Goose will try to extract the following information:
- Main text of an article
- Main image of article
- Any Youtube/Vimeo movies embedded in article
- Meta Description
- Meta tags
- Publish Date
Try it here: http://jimplush.com/blog/goose
OPTION 2
You could use a Java wrapper (e.g. GhostDriver) for running a headless browser, like PhantomJS. Then, fetch the website and find the img
element with the largest dimensions. This GhostDriver test case shows how to query the DOM for elements and get it's renderd size.
OPTION 3
Use a library like jsoup that helps you parse HTML. Then get the value from the src
attribute from all img
tags. Request each URL you find for an image and measure their sizes. The one with the biggest dimensions is likely to be the website's main image.
Another solution would be to extract the meta tags for social media sharing first, if they are present, you are lucky otherwise you stil can try the other solutions.
<meta property="og:image" content="http://www.example.com/image.jpg"/>
<meta name="twitter:image" content="http://www.example.com/image.jpg">
<meta itemprop="image" content="http://www.example.com/image.jpg">
If you are yousing JSOUP the code would be like that:
String imageUrlOpenGraph = document.select("meta[property=og:image]").stream()
.findFirst()
.map(doc -> doc.attr("content").trim())
.orElse(null);
String imageUrlTwitter = document.select("meta[name=twitter:image]").stream()
.findFirst()
.map(doc -> doc.attr("content").trim())
.orElse(null);
String imageUrlGooglePlus = document.select("meta[itemprop=image]").stream()
.findFirst()
.map(doc -> doc.attr("content").trim())
.orElse(null);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With