Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I scrape text and images from a random web page?

I need a way to visually represent a random web page on the internet.

Let's say for example this web page.

Currently, these are the standard assets I can use:

  • Favicon: Too small, too abstract.
  • Title: Very specific but poor visual aesthetics.
  • URL: Nobody cares to read.
  • Icon: Too abstract.
  • Thumbnail: Hard to get, too ugly (many elements crammed in a small space).

I need to visually represent a random website in a way that is very meaningful and inviting for others to click on it.

I need something like what Facebook does when you share a link:

enter image description here

It scraps the link for images and then creates a beautiful meaningful tile which is inviting to click on.

enter image description here

Any way I can scrape the images and text from websites? I'm primarily interested in a Objective-C/JavaScript combo but anything will do and will be selected as an approved answer.

Edit: Re-wrote the post and changed the title.

like image 605
Vulkan Avatar asked Mar 17 '18 21:03

Vulkan


People also ask

Can you scrape images from a website?

If you're using Chrome browser, Image downloader for Chrome will be a good choice. For Edge user, you can try Microsoft Edge Image Downloader. Let's take Chrome as an example. Open the website you are aiming to scrape pictures from.


2 Answers

Websites will often provide meta information for user friendly social media sharing, such as Open Graph protocol tags. In fact, in your own example, the reddit page has Open Graph tags which make up the information in the Link Preview (look for meta tags with og: properties).

A fallback approach would be to implement site specific parsing code for most popular websites that don't already conform to a standardized format or to try and generically guess what the most prominent content on a given website is (for example, biggest image above the fold, first few sentences of the first paragraph, text in heading elements etc).

Problem with the former approach is that you you have to maintain the parsers as those websites change and evolve and with the latter that you simply cannot reliably predict what's important on a page and you can't expect to always find what you're looking for either (images for the thumbnail, for example).

Since you will never be able to generate meaningful previews for a 100% of the websites, it boils down to a simple question. What's an acceptable rate of successful link previews? If it's close to what you can get parsing standard meta information, I'd stick with that and save myself a lot of headache. If not, alternatively to the libraries shared above, you can also have a look at paid services/APIs which will likely cover more use cases than you could on your own.

like image 156
Unglückspilz Avatar answered Sep 24 '22 18:09

Unglückspilz


This is what the OpenGraph standard is for. For instance, if you go to the Reddit post in the example, you can view the page information provided by HTML <meta /> tags (all the ones with names starting with 'og'):

reddit opengraph example

However, it is not possible for you to get the data from inside a web browser; CORS prevents the request to the URL. In fact, what Facebook seems to do is send the URL to their servers and have them perform a request to get the required information, and sending it back.

like image 20
andrew Avatar answered Sep 21 '22 18:09

andrew