I need a way to visually represent a random web page on the internet. Let's say for example this web page. Currently, these are the standard assets I can use: <ul> <li> Favicon: Too small, too abstract.</li> <li> Title: Very specific but poor visual aesthetics.</li> <li> URL: Nobody cares to read.</li> <li> Icon: Too abstract.</li> <li> Thumbnail: Hard to get, too ugly (many elements crammed in a small space).</li> </ul> I need to visually represent a random website in a way that is very meaningful and inviting for others to click on it. I need something like what Facebook does when you share a link: <img src="https://i.stack.imgur.com/wTxYB.png" alt="enter image description here"> It scraps the link for images and then creates a beautiful meaningful tile which is inviting to click on. <img src="https://i.stack.imgur.com/RtM4Q.png" alt="enter image description here"> Any way I can scrape the images and text from websites? I'm primarily interested in a Objective-C/JavaScript combo but anything will do and will be selected as an approved answer. Edit: Re-wrote the post and changed the title.

This is what the OpenGraph standard is for. For instance, if you go to the Reddit post in the example, you can view the page information provided by HTML <code><meta /></code> tags (all the ones with names starting with 'og'): <img src="https://i.stack.imgur.com/mI9wT.png" alt="reddit opengraph example"> However, it is not possible for you to get the data from inside a web browser; CORS prevents the request to the URL. In fact, what Facebook seems to do is send the URL to their servers and have them perform a request to get the required information, and sending it back.

How can I scrape text and images from a random web page?

2 Answers

Websites will often provide meta information for user friendly social media sharing, such as Open Graph protocol tags. In fact, in your own example, the reddit page has Open Graph tags which make up the information in the Link Preview (look for meta tags with og: properties).

A fallback approach would be to implement site specific parsing code for most popular websites that don't already conform to a standardized format or to try and generically guess what the most prominent content on a given website is (for example, biggest image above the fold, first few sentences of the first paragraph, text in heading elements etc).

Problem with the former approach is that you you have to maintain the parsers as those websites change and evolve and with the latter that you simply cannot reliably predict what's important on a page and you can't expect to always find what you're looking for either (images for the thumbnail, for example).

Since you will never be able to generate meaningful previews for a 100% of the websites, it boils down to a simple question. What's an acceptable rate of successful link previews? If it's close to what you can get parsing standard meta information, I'd stick with that and save myself a lot of headache. If not, alternatively to the libraries shared above, you can also have a look at paid services/APIs which will likely cover more use cases than you could on your own.

156

answered Sep 24 '22 18:09

Unglückspilz

This is what the OpenGraph standard is for. For instance, if you go to the Reddit post in the example, you can view the page information provided by HTML <meta /> tags (all the ones with names starting with 'og'):

reddit opengraph example

However, it is not possible for you to get the data from inside a web browser; CORS prevents the request to the URL. In fact, what Facebook seems to do is send the URL to their servers and have them perform a request to get the required information, and sending it back.

answered Sep 21 '22 18:09

andrew

Related questions
                            
                                Recursively calling an asynchronous API call
                            
                                styled components :hover with react-native and react-native-web
                            
                                Puppeteer: is there a way to access the DevTools Network API?
                            
                                jQuery append lose style
                            
                                Vue: Displaying nested data in a table
                            
                                Hitting FB.init returns error "Polyfill JSON does not have implementation of stringify"
                            
                                Why does encoding wav file to base64 with python and online webapp give different results?
                            
                                Catching All Promise Rejections in an Async Function in JavaScript
                            
                                React abort/cancel AJAX request. Axios or XHR
                            
                                Highmap R (or) javascript - adding custom legend
                            
                                jQuery in react component
                            
                                Change dynamic status with x-editable in laravel
                            
                                What is render hijacking in react?
                            
                                How can I get the raw download size of a request using Puppeteer?
                            
                                Google sheets API with chrome extension, how to use?
                            
                                How to make a component show/hide on scroll in react.js
                            
                                javascript copy element to clipboard with all styles
                            
                                Are babel decorators the same as TypeScript's?
                            
                                What is the best way to projecting fields of JavaScript object?
                            
                                ECharts: Disable default click action on legends

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I scrape text and images from a random web page?

Tags:

javascript

html

ios

objective-c

wkwebview

Vulkan

People also ask

2 Answers

Unglückspilz

andrew

Recent Activity

Donate For Us