What is the difference between web-crawling and web-scraping? [duplicate]

2 Answers

Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently.

Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) crawler wouldn't do, i.e.:

Have no regard for robots.txt
Identify itself as a browser
Submit forms with data
Execute Javascript (if required to act like a user)

166

answered Oct 11 '22 11:10

Ben

Yes, they are different. In practice, you may need to use both.

(I have to jump in because, so far, the other answers don't get to the essence of it. They use examples but don't make the distinctions clear. Granted, they are from 2010!)

Web scraping, to use a minimal definition, is the process of processing a web document and extracting information out of it. You can do web scraping without doing web crawling.

Web crawling, to use a minimal definition, is the process of iteratively finding and fetching web links starting from a list of seed URL's. Strictly speaking, to do web crawling, you have to do some degree of web scraping (to extract the URL's.)

To clear up some concepts mentioned in the other answers:

robots.txt is intended to apply to any automated process that accesses a web page. So it applies to both crawlers and scrapers.
'Proper' crawlers and scrapers, both, should identify themselves accurately.

Some references:

Wikipedia on web scraping
Wikipedia on web crawlers
Wikipedia on robots.txt

answered Oct 11 '22 10:10

David J.

Related questions
                            
                                How does a full text search server like Sphinx work?
                            
                                Internationalization and Search Engine Optimization
                            
                                Marking up a search result list with HTML5 semantics
                            
                                Connect to SphinxQL through Linux command-line
                            
                                Use of indexes for multi-word queries in full-text search (e.g. web search)
                            
                                Which are the best alternatives to Lucene? [closed]
                            
                                Elasticsearch - How to normalize score when combining regular query and function_score?
                            
                                Search engine solution for Django that actually works?
                            
                                Improving search result using Levenshtein distance in Java
                            
                                Building a web search engine [closed]
                            
                                ElasticSearch - Searching For Human Names
                            
                                Search engine Lucene vs Database search
                            
                                Is there a search engine that support regular expression search? [closed]
                            
                                Are search engines going to see my dynamically created content in Bootstrap tabs?
                            
                                Is there a good indexing / search engine for Node.js? [closed]
                            
                                How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability
                            
                                Designing a web crawler
                            
                                Search in html source with GOOGLE? [closed]
                            
                                What does percolator mean/do in elasticsearch?
                            
                                How do websites know they're not the default home page or search provider?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between web-crawling and web-scraping? [duplicate]

Tags:

search-engine

web-scraping

web-crawler

wassimans

People also ask

2 Answers

Ben

David J.

Recent Activity

Donate For Us