Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between web-crawling and web-scraping? [duplicate]

Is there a difference between Crawling and Web-scraping?

If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine?

like image 433
wassimans Avatar asked Dec 01 '10 17:12

wassimans


People also ask

What is the difference between crawling and scraping?

The web crawling process usually captures generic information, whereas web scraping hones in on specific data set snippets. Web scraping, also known as web data extraction, is similar to web crawling in that it identifies and locates the target data from web pages.

What is difference between data scraping and web scraping?

Web scraping is basically extracting data from websites in an automated manner. It is automated because it uses bots to scrape the information or content from websites. It's a programmatic analysis of a web page to download information from it. Data scraping involves locating data and then extracting it.

What is crawler and scraper?

A crawler(or spider) will follow each link in the page it crawls from the starter page. This is why it is also referred to as a spider bot since it will create a kind of a spider web of pages. A scraper will extract the data from a page, usually from the pages downloaded with the crawler.

Is Google a web crawler or web scraper?

Famous search engines such as Google, Yahoo and Bing do web crawling and use this information for indexing web pages.


2 Answers

Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently.

Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) crawler wouldn't do, i.e.:

  • Have no regard for robots.txt
  • Identify itself as a browser
  • Submit forms with data
  • Execute Javascript (if required to act like a user)
like image 166
Ben Avatar answered Oct 11 '22 11:10

Ben


Yes, they are different. In practice, you may need to use both.

(I have to jump in because, so far, the other answers don't get to the essence of it. They use examples but don't make the distinctions clear. Granted, they are from 2010!)

Web scraping, to use a minimal definition, is the process of processing a web document and extracting information out of it. You can do web scraping without doing web crawling.

Web crawling, to use a minimal definition, is the process of iteratively finding and fetching web links starting from a list of seed URL's. Strictly speaking, to do web crawling, you have to do some degree of web scraping (to extract the URL's.)

To clear up some concepts mentioned in the other answers:

  • robots.txt is intended to apply to any automated process that accesses a web page. So it applies to both crawlers and scrapers.

  • 'Proper' crawlers and scrapers, both, should identify themselves accurately.

Some references:

  • Wikipedia on web scraping
  • Wikipedia on web crawlers
  • Wikipedia on robots.txt
like image 27
David J. Avatar answered Oct 11 '22 10:10

David J.