I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google disallows automated access in their TOS, so if you accept their terms you would break them. That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :) There are two options to scrape Google results: 1) Use their API <blockquote> UPDATE 2020: Google has reprecated previous APIs (again) and has new prices and new limits. Now (https://developers.google.com/custom-search/v1/overview) you can query up to 10k results per day at 1,500 USD per month, more than that is not permitted and the results are not what they display in normal searches. </blockquote> <ul> <li> You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather. </li> <li> If you want a higher amount of API requests you need to pay. </li> <li> 60 requests per hour cost 2000 USD per year, more queries require a custom deal. </li> </ul> 2) Scrape the normal result pages <ul> <li>Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.</li> <li>If you scrape at a rate higher than 8 (updated from 15) keyword requests per hour you risk detection, higher than 10/h (updated from 20) will get you blocked from my experience.</li> <li>By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 1000 requests per hour. (24k a day) (updated) </li> <li>There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.</li> </ul> 3) Alternatively use a scraping service (updated) <ul> <li>Recently a customer of mine had a huge search engine scraping requirement but it was not 'ongoing', it's more like one huge refresh per month. In this case I could not find a self-made solution that's 'economic'. I used the service at http://scraping.services instead. They also provide open source code and so far it's running well (several thousand resultpages per hour during the refreshes)</li> <li>The downside is that such a service means that your solution is "bound" to one professional supplier, the upside is that it was a lot cheaper than the other options I evaluated (and faster in our case)</li> <li>One option to reduce the dependency on one company is to make two approaches at the same time. Using the scraping service as primary source of data and falling back to a proxy based solution like described at 2) when required.</li> </ul>

Is it ok to scrape data from Google results? [closed]

2 Answers

Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

UPDATE 2020: Google has reprecated previous APIs (again) and has new prices and new limits. Now (https://developers.google.com/custom-search/v1/overview) you can query up to 10k results per day at 1,500 USD per month, more than that is not permitted and the results are not what they display in normal searches.

You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.
If you want a higher amount of API requests you need to pay.
60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
If you scrape at a rate higher than 8 (updated from 15) keyword requests per hour you risk detection, higher than 10/h (updated from 20) will get you blocked from my experience.
By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 1000 requests per hour. (24k a day) (updated)
There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.

3) Alternatively use a scraping service (updated)

Recently a customer of mine had a huge search engine scraping requirement but it was not 'ongoing', it's more like one huge refresh per month.
In this case I could not find a self-made solution that's 'economic'.
I used the service at http://scraping.services instead. They also provide open source code and so far it's running well (several thousand resultpages per hour during the refreshes)
The downside is that such a service means that your solution is "bound" to one professional supplier, the upside is that it was a lot cheaper than the other options I evaluated (and faster in our case)
One option to reduce the dependency on one company is to make two approaches at the same time. Using the scraping service as primary source of data and falling back to a proxy based solution like described at 2) when required.

answered Oct 13 '22 11:10

John

Google will eventually block your IP when you exceed a certain amount of requests.

answered Oct 13 '22 11:10

Severin

Related questions
                            
                                Python: Disable images in Selenium Google ChromeDriver
                            
                                csv.writer writing each character of word in separate column/cell
                            
                                How do you scrape AJAX pages?
                            
                                Web scraping - how to identify main content on a webpage
                            
                                How can I download a file on a click event using selenium?
                            
                                How to "scan" a website (or page) for info, and bring it into my program?
                            
                                Scrape An Entire Website [closed]
                            
                                Get meta tag content property with BeautifulSoup and Python
                            
                                Using BeautifulSoup to extract text without tags
                            
                                Change IP address dynamically?
                            
                                Simple jQuery selector only selects first element in Chrome..?
                            
                                How to print an exception in Python 3?
                            
                                Save and render a webpage with PhantomJS and node.js
                            
                                Click a Button in Scrapy
                            
                                Converting html to text with Python
                            
                                Using python Requests with javascript pages
                            
                                Scrape web pages in real time with Node.js
                            
                                How to manage a 'pool' of PhantomJS instances
                            
                                What should I use to open a url instead of urlopen in urllib3
                            
                                Selenium-Debugging: Element is not clickable at point (X,Y)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it ok to scrape data from Google results? [closed]

Tags:

web-scraping

ML_

People also ask

2 Answers

John

Severin

Recent Activity

Donate For Us