I need help tackling a problem. I need a program which, given a site, finds and extracts the "main" picture, i.e. the one which represents the site. (To say it is the biggest or the first picture is sometimes but not always true). How should I approach this? Are there any libraries that could help me with this? Thanks!

OPTION 1 You could checkout Goose. It does something similar to what Pocket and Readability does, i.e. try to extract the main article from a given webpage using a set of heuristics. It can apparently also extract the main image from that article, but it is a bit of a hit and miss, so 60% of the time it works everytime. It used to be a Java project but rewritten to Scala. From the readme <blockquote> Goose will try to extract the following information: <ul> <li>Main text of an article</li> <li>Main image of article</li> <li>Any Youtube/Vimeo movies embedded in article</li> <li>Meta Description</li> <li>Meta tags</li> <li>Publish Date</li> </ul> </blockquote> Try it here: http://jimplush.com/blog/goose <hr> OPTION 2 You could use a Java wrapper (e.g. GhostDriver) for running a headless browser, like PhantomJS. Then, fetch the website and find the <code>img</code> element with the largest dimensions. This GhostDriver test case shows how to query the DOM for elements and get it's renderd size. <hr> OPTION 3 Use a library like jsoup that helps you parse HTML. Then get the value from the <code>src</code> attribute from all <code>img</code> tags. Request each URL you find for an image and measure their sizes. The one with the biggest dimensions is likely to be the website's main image.

Another solution would be to extract the meta tags for social media sharing first, if they are present, you are lucky otherwise you stil can try the other solutions. <pre class="prettyprint"><code><meta property="og:image" content="http://www.example.com/image.jpg"/> <meta name="twitter:image" content="http://www.example.com/image.jpg"> <meta itemprop="image" content="http://www.example.com/image.jpg"> </code></pre> If you are yousing JSOUP the code would be like that: <pre class="prettyprint"><code> String imageUrlOpenGraph = document.select("meta[property=og:image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); String imageUrlTwitter = document.select("meta[name=twitter:image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); String imageUrlGooglePlus = document.select("meta[itemprop=image]").stream() .findFirst() .map(doc -> doc.attr("content").trim()) .orElse(null); </code></pre>

How to find and extract "main" image in website

Tags:

java

html

I need help tackling a problem. I need a program which, given a site, finds and extracts the "main" picture, i.e. the one which represents the site. (To say it is the biggest or the first picture is sometimes but not always true).

How should I approach this? Are there any libraries that could help me with this? Thanks!

403

asked Aug 16 '13 07:08

Idan

2 Answers

OPTION 1

You could checkout Goose. It does something similar to what Pocket and Readability does, i.e. try to extract the main article from a given webpage using a set of heuristics. It can apparently also extract the main image from that article, but it is a bit of a hit and miss, so 60% of the time it works everytime.

It used to be a Java project but rewritten to Scala.

From the readme

Goose will try to extract the following information:

Main text of an article

Main image of article

Any Youtube/Vimeo movies embedded in article

Meta Description

Meta tags

Publish Date

Try it here: http://jimplush.com/blog/goose

OPTION 2

You could use a Java wrapper (e.g. GhostDriver) for running a headless browser, like PhantomJS. Then, fetch the website and find the img element with the largest dimensions. This GhostDriver test case shows how to query the DOM for elements and get it's renderd size.

OPTION 3

Use a library like jsoup that helps you parse HTML. Then get the value from the src attribute from all img tags. Request each URL you find for an image and measure their sizes. The one with the biggest dimensions is likely to be the website's main image.

165

answered Oct 17 '22 05:10

mqchen

Another solution would be to extract the meta tags for social media sharing first, if they are present, you are lucky otherwise you stil can try the other solutions.

<meta property="og:image" content="http://www.example.com/image.jpg"/>
<meta name="twitter:image" content="http://www.example.com/image.jpg">
<meta itemprop="image" content="http://www.example.com/image.jpg">

If you are yousing JSOUP the code would be like that:

    String imageUrlOpenGraph = document.select("meta[property=og:image]").stream()
            .findFirst()
            .map(doc -> doc.attr("content").trim())
            .orElse(null);

    String imageUrlTwitter = document.select("meta[name=twitter:image]").stream()
                .findFirst()
                .map(doc -> doc.attr("content").trim())
                .orElse(null);

    String imageUrlGooglePlus = document.select("meta[itemprop=image]").stream()
                .findFirst()
                .map(doc -> doc.attr("content").trim())
                .orElse(null);

answered Oct 17 '22 04:10

mmx73

Related questions
                            
                                What is the difference between -HeapDumpOnOutOfMemoryError and +HeapDumpOnOutOfMemoryError options?
                            
                                How is the Classloader for a class chosen?
                            
                                Simple Variable in Web Flow
                            
                                single-element enum type singletone with lazy loading capability
                            
                                Read remote .csv file using opencsv
                            
                                What is the best practice for handling multiple profiles in Spring with java config?
                            
                                Jersey REST WS Error: "Missing dependency for method... at parameter at index X"
                            
                                Difference between 2 collections? (elements in collection1, but not in collection2)
                            
                                How JSP page should check authentication
                            
                                Java - filepath - Invalid escape sequence
                            
                                Garbage collection vs manual memory management
                            
                                not all junit tests are running in eclipse
                            
                                Action TIME_SET in android getting called many times without changing the time manually
                            
                                Could not load a dependent class com/jcraft/jsch/Logger
                            
                                Insert into an already-sorted list
                            
                                IntelliJ + groovy DSL: How to exclude files from being compiled by groovy plugin?
                            
                                Make JList Values Unselectable [duplicate]
                            
                                Fixing Error: Unreported Exception InterruptedException
                            
                                Unable to use VisualVM profiler with Maven Jetty plugin
                            
                                java.lang.NoSuchFieldError: DEF_CONTENT_CHARSET

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With