crawler vs scraper

4 Answers

A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).

A scraper takes pages that have been downloaded or, in a more general sense, data that's formatted for display, and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.

Depending on how you use the result, scraping may well violate the rights of the owner of the information and/or user agreements about use of web sites (crawling violates the latter in some cases as well). Many sites include a file named robots.txt in their root (i.e. having the URL http://server/robots.txt) to specify how (and if) crawlers should treat that site -- in particular, it can list (partial) URLs that a crawler should not attempt to visit. These can be specified separately per crawler (user-agent) if desired.

answered Oct 06 '22 05:10

Jerry Coffin

Crawlers surf the web, following links. An example would be the Google robot that gets pages to index. Scrapers extract values from forms, but don't necessarily have anything to do with the web.

answered Oct 06 '22 05:10

Steven Sudit

Web crawler gets links (Urls - Pages) in a logic and scraper get values (extracting) from HTML.

There are so many web crawler tools. Visit page to see some. Any XML - HTML parser can used to extract (scrape) data from crawled pages. (I recommend Jsoup for parsing and extracting data)

answered Oct 06 '22 07:10

cuneytykaya

Generally, crawlers would follow the links to reach numerous pages while scrapers is, in some sense, just pulling the contents displayed online and would not reach the deeper links.

The most typical crawler is google bots, which would follow the links to reach all the web pages on your website and would index the contents if they found it useful(that's why you need robots.txt to tell which contents you do not want to be indexed). So we could search such kind of contents on its website. While the purpose of scrapers is just to pull the contents for personal uses and would not have much effects on others.

However, there's no distinct difference about crawlers and scrapers now as some automated web scraping tools also allow you to crawl the website by following the links, like Octoparse and import.io. They are not the crawlers like google bots, but they are able to automatically crawl the websites to get numerous data without coding.

answered Oct 06 '22 07:10

M John

Related questions
                            
                                Scrapy Python Set up User Agent
                            
                                how to filter duplicate requests based on url in scrapy
                            
                                Detecting honest web crawlers
                            
                                How to find sitemap.xml path on websites?
                            
                                Automated link-checker for system testing [closed]
                            
                                Node.JS: How to pass variables to asynchronous callbacks? [duplicate]
                            
                                python: [Errno 10054] An existing connection was forcibly closed by the remote host
                            
                                Python: Disable images in Selenium Google ChromeDriver
                            
                                How to do HTTP-request/call with JSON payload from command-line?
                            
                                Detect Search Crawlers via JavaScript
                            
                                Python: maximum recursion depth exceeded while calling a Python object
                            
                                How do you archive an entire website for offline viewing?
                            
                                Change IP address dynamically?
                            
                                Click a Button in Scrapy
                            
                                How to write a crawler?
                            
                                How do I make a simple crawler in PHP? [closed]
                            
                                getting Forbidden by robots.txt: scrapy
                            
                                Spider a Website and Return URLs Only
                            
                                Anyone know of a good Python based web crawler that I could use?
                            
                                PyPi download counts seem unrealistic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

crawler vs scraper

Tags:

terminology

web-crawler

scraper

Nayn

People also ask