I'm trying to scrape the title of the product avilable in this webpage using requests module, but the script always throws AttributeError
even when the product title is in the page source (ctrl + U
).
I've tried with (throws AttributeError
):
import requests
from bs4 import BeautifulSoup
link = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}
res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text,"lxml")
try:
product_title = soup.select_one("h1 > span").get_text(strip=True)
except AttributeError: product_title = ""
print(product_title)
Expected output:
Gigabyte GeForce RTX 3070 Aorus Master 8GB OC GPU
How can I scrape the product title from that webpage?
PS I've tried with this library cloudscraper as well, but no luck.
EDIT:
This is what I get raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url
when I run the following piece of code:
import cfscrape
url = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}
token, agent = cfscrape.get_tokens(url, headers=headers)
print(token, agent)
I know I could have used the value of cf_clearance
within cookies to access the page content, if I could get the value of token from above attempt.
This is only a placeholder for research that might be useful to others looking at this Cloudflare bypass issue.
Scraping information from a website that is using either Cloudflare CAPTCHA or Javascript challenge for enhanced protection.
Using a standard Python Requests.Get the Cloudflare service will return a 403 Forbidden error code.
import requests
URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
'-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}
response = requests.get(URL, headers=headers)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden
If we look at the response.headers we can see that a Cloudflare server is proxying our request to the target URL.
...continued from the code above
for key, value in response.headers.items():
print(f'KEY NAME: {key}')
print(f'KEY VALUE: {value}')
print('-----------------------')
# output
KEY NAME: Date
KEY VALUE: Sun, 13 Jun 2021 16:39:03 GMT
-----------------------
KEY NAME: Content-Type
KEY VALUE: text/html; charset=UTF-8
-----------------------
KEY NAME: Transfer-Encoding
KEY VALUE: chunked
-----------------------
KEY NAME: Connection
KEY VALUE: close
-----------------------
KEY NAME: Permissions-Policy
KEY VALUE: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
-----------------------
KEY NAME: Cache-Control
KEY VALUE: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
-----------------------
KEY NAME: Expires
KEY VALUE: Thu, 01 Jan 1970 00:00:01 GMT
-----------------------
KEY NAME: X-Frame-Options
KEY VALUE: SAMEORIGIN
-----------------------
KEY NAME: cf-request-id
KEY VALUE: 0aa7d6c7c4000007ff7201b000000001
-----------------------
KEY NAME: Expect-CT
KEY VALUE: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
-----------------------
KEY NAME: Set-Cookie
KEY VALUE: __cf_bm=72427e2af66c7177feeb88a847fae9c26b66c681-1623602343-1800-AZAmqDfaHZU8IXOH/i3BBVf8pGcws0Gc1Tln5yKUepe3utWlCpagxvALDW6wiHd2pli9Zl45Mg8gC/QSoUFhoes=; path=/; expires=Sun, 13-Jun-21 17:09:03 GMT; domain=.cclonline.com; HttpOnly; Secure; SameSite=None
-----------------------
KEY NAME: Vary
KEY VALUE: Accept-Encoding
-----------------------
KEY NAME: Server
KEY VALUE: cloudflare
-----------------------
KEY NAME: CF-RAY
KEY VALUE: 65ecc0b9383b07ff-ATL
-----------------------
KEY NAME: Content-Encoding
KEY VALUE: gzip
-----------------------
If we look at the response.text associated with the Python Requests we can see other evidence related to the Cloudflare protection.
...continued from the code above
print(response.text)
# output
truncated...
<title>Please Wait... | Cloudflare</title>
<meta name="captcha-bypass" id="captcha-bypass" />
truncated...
<form class="challenge-form managed-form" id="challenge-form" action="/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/?__cf_chl_managed_tk__=7d4597196bb14948881846ca16631b64c55f06d3-1623602854-0-AcX2yHJM2sCalL03Opq9RiFjASeYE0Xs0KG4XeG1lezzhzEyu-bL8xsdHuEjNIIKaJkWEmha4DhViRlqWEP_HREOdA8YAY7nnNkBAHbNMs6p_AWgYNLPnSNM13PO2I96hdABtoaaKjOzV4AyJQJ8f08XEW2flN97rPxIMeiR0tI1a3PiON2dN9E_YCyneAuCUfaYWUNGL0Bqd_rkYp3Ljb2zk_kGWizckr1fvhodSEjEB-ByYVK8ODNox2oZ4XPcmCYJ6UNDmbNc406BjMeTf3e72Z7vgdnt3V714VrGN4w_Y4VQ2X1V0OVKUKEH9B5Rxa_4fEZiMAAdxZ6idg69JYMKftuuLemr53n5WAwTwyX2G7N9jmjtarxEQcCqoj9oY7oSFwQTb3ZVb9i5EeavKaE1_67wxpyPybNidBDxhLazDEMefPZGDsV9mSziuIQ90nS5vn-7sUvC8BJATNWPbh6OduchXy-QcMeYhurtukUCm3oDQMP7r4g4qvDCWI3_-ku7u-B4G2XI2kwM_tLVEZiH5uHPjWpHE6eFWohiCTxd4p7vHg7z5ug9feRalYqu3GfInd82GZ-j-7nCqLDmPh2Sjlu6sJGfopqM3XlBrd1kgRZU3Z4uw6JIIqfH0M6K3_weTtem0-Z1zhDUBbVDvgJVeHNNh_bTxHGWbFB0f80tALBMbt67RftO5u1XBUZ-TRftteXBwJ8gmYzOZTo4lQOGQ_771urYXsTuW_sp8PwxvQpEyCnY8zD8dmVz0-waZhOet8MQMwduN2nfGUOrCMwUYO9McsBqzfsT5PJZVkDm-rYBBwqw0PIwvm1-N8ymAjrpSN6ps4FerqK1uQOo77FLiOq8JCOVqdETIZ9NO07A" method="POST" enctype="application/x-www-form-urlencoded">
truncated...
<input type="hidden" name="r" value="d5db3eb87c9b42ec7f076916611c296abfd2c842-1623602854-0-AXz7+uyFGbpY1aOLgfZMm0oIiiepEo5I5QmdTnvMmL9fDUc4OMEa2CNYXsbHVjOzdYO+PqegjpNL8R3D9LhDc+Xo0y0ira1zO7foozPj0qdcUpNNr2ZOHqgUyKws6dVgeBNUdF+v9+eNFxSHxOhc4DWDLIw9guBqJg1GaBjG3QCQdZmyFbPxXUQtXTFmtVVuqch9qBFLa/u9deMBCxCWi5fyKoOINtyBtyT4p79ITb9T+6T7fl2epMXNHO6xBW2dPnDP1FmjUQ04CG3ydOaDS5qoSFMPr4InVbMcI2NbQYJYPfWjmncMaga6K+NMNvv8wtiyXpEeWsUgFFeQoDJEuvLI+wkI8mT+vXAnXd8LWy9TpEDVK6uxtLF2C75aU7qJxI9RKANGluWYUXeqE1tXgppgZraIGfRWNPVsQZzqd6SK+Zsg8x8UH7oRRD9blMMPMaekcFQ3zT8QQ5BzEc8wEQ68OhmKbFuAeV/YhhWshpm808gcVHIFH17I+0MEidfV/ny5wBSRZJyQUfOSU9iAv/minNWF6ZA21E/+Zebda2lVF6gyEHgrjecxuOxzY2I2qMm0RCEHO4oSk/X8EtMYirGCQ3FD8PzSvZYx+34QZutXFLVvqT3CR/UcsXybG6wllvIGvZ6j/gdoAwfcS27MyO4mXDMk6TfDqdi+NqlItwgWNdp461RQmPdChRp9kKEy3sTsIAGW9Ky1k/xYYcTvLDpCGFICBEm2JhDyp/FEF9UBYia7XJ4aUEncSUeViqaQ8bXpPk6kEPH5RYEcfaX3he0W5aZHHIGcjgOFZsuu45MWREvbHjO+RcPMib4L+lU1cKQoYx+w5b9e4AJiRnGog3a6E3i/L75bSnk7L3qA+DofeeccI/RPitqDb/lX31fkhwHfdRWoLt+OILsUfHNni/olGABEUDruwDVpR32xlieS7vekdmQL3oOu5BkAOXoObbb+2nzo6Dvgw7M7rb4muC7US4yCTK0BeGSfu2XvFta228IoGIGa8BjUcb09K6nRdWUwrCXLYS+vIJTegKMeyxlMKNXw7vIaPh9vht4zblhN0bqkN/m/opyXEtzLfhsLuEkHdQ0GhTUk2nYgHeKX0j6eW0uQhAD/9TLf6UgILCk0+nQvXfEffQCCe/hEfBfkAgiPhr1E3uyPB4vp6Fpy2nnkkzmGv/3P5wg6afKDmU2Ic32u3U47hOlghnc7NlbzFb5R8Tx6vWrkXMDYHdOaaudLtPp5N9y1ceXXaMNAFMVmoqaiHWuV4KN+2rLolSOGUEFNEoRN6Jw9mlq/zniK23gQ2lSy+wIHPRGvRCxhRr5DeskvLgyviAk7IhLH3zMpqxd7i05BIPV3sB8orBzVE4Rqmam3evpTVEMMFRDt/Ol6XUJi66QrLgJyusuv5xL4pKPWZrw/hn3a5j0zrrChUbvM3S94BeWiJS48hA35S9mXLfaKMAZTYZTMqhbW77qwUuquwW2lPEAgSPY7WvvnNRUPXsS1KCPpiuE0TuDFaZQi9UTqlzkQIq84wqVRjQZ0Y0m3PQeI2BbJZ8woKIKiABWbSOuV/kyy5H4L+RVL7Jmc2ndl3HaQ4XlnwDmTuK/gMbRvZe1taVHOyYsXmfEY4XkiaDUneGjBEGnWyiv49DtiG2TLmmIpP1UITmO677eDSoNLHpxp1guMjwL5m3XHKOFNtpLzuiVH4UJdgTjtnmbGHmKGtyy0k3GPZrwyVkZRyS+FZZ5WhTs05rhS+1sg3oDCyTbWeYX9T4VVswRjxq1HsyH8NdZTN4f9BTn9VU0+9JnVAkgLM4JCkV6wqwQf+QMK/MaYWvBwSjYgFUxdEdT7Rls85/M+4GxcaGsiNmsA5Q==">
<input type="hidden" name="cf_captcha_kind" value="h">
<input type="hidden" name="vc" value="4845a44c225a1fa6a61708e11b613971">
truncated...
<script type="text/javascript">
//<![CDATA[
(function(){
var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
var trkjs = isIE ? new Image() : document.createElement('img');
trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=65eccd326d61f331");
trkjs.id = "trk_managed_js";
trkjs.setAttribute("alt", "");
document.body.appendChild(trkjs);
var cpo=document.createElement('script');
cpo.type='text/javascript';
cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=65eccd326d61f331";
document.getElementsByTagName('head')[0].appendChild(cpo);
}());
//]]>
</script>
The information above shows that the Python Requests that was transmitted to the target URL was intercepted by a Cloudflare server, which is challenging the request. This challenge has to be bypassed before the initial request will be allowed to continue.
The OP stated that they attempted to use the cfscrape Python Package to obtain token information from the Cloudflare server.
A standard cfscrape request provide identical responses as Python Requests.
import cfscrape
URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
'-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}
scraper = cfscrape.create_scraper(delay=10)
response = scraper.get(URL, headers=headers)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden
The cfscrape package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.
From cfscrape source code:
def is_cloudflare_captcha_challenge(resp):
return (
resp.status_code == 403
and resp.headers.get("Server", "").startswith("cloudflare")
and b"/cdn-cgi/l/chk_captcha" in resp.content
)
# the function above is called from this
def request(self, method, url, *args, **kwargs):
resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)
# Check if Cloudflare captcha challenge is presented
if self.is_cloudflare_captcha_challenge(resp):
self.handle_captcha_challenge(resp, url)
# Check if Cloudflare anti-bot "I'm Under Attack Mode" is enabled
if self.is_cloudflare_iuam_challenge(resp):
resp = self.solve_cf_challenge(resp, **kwargs)
return resp
The handle_captcha_challenge function is what tries to solve the Cloudflare javascript challenge. This section of the code is what is failing. It's unclear what part of that section is failing, so additional research and testing is required.
PLEASE NOTE: According to the package's developer the module is no longer supported.
The OP also stated that they attempted to use the cloudscraper Python Package to obtain token information from the Cloudflare server. It is worth nothing that cloudscraper was forked from cfscrape, so the syntax is similar.
cloudscraper gets the same 403 Forbidden error code as cfscrape.
import cloudscraper
URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
'-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}
scraper = cloudscraper.create_scraper()
response = scraper.get(URL)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden
The cloudscraper package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.
The OP also stated that they attempted to use the selenium Python package.
SPECIAL NOTE: During my testing I used selenium with webdrivers for Google Chrome, Mozilla Firefox and Microsoft Edge.
Within the last 12 months these Options could be used in selenium to bypass Cloudflare protection. Unfortunately, these Options do not work today
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
# additional disable-blink-features are available in Chromium source code on Github
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
Below is a selenium code example using the Chrome webdriver with the switches above.
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"
driver.get(URL)
The code above opens a browser session, which is confronted with a Cloudflare Javascript challenge. During testing with the switches mentioned above this challenge does not stop. The Cloudflare Ray ID, which are unique id per request rotate many times before I manually terminated the session.
seleniumwire is required to obtain the status code
Below is a headless mode Chrome webdriver session, which also shows the 403 Forbidden error code for the target URL. The session also shows that hcaptcha.com anti-bot technology is now in the mix.
from seleniumwire import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--headless")
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"
driver.get(URL)
for request in driver.requests:
print(f'Status Code: {request.response}')
print(f'Host Name: {request.host}')
# output
Status Code: 403
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 302
Host Name: hcaptcha.com
-----------------------
Status Code: 200
Host Name: newassets.hcaptcha.com
-----------------------
driver.quit()
A standard Chrome webdriver session using the UI shows an iFrame with an "I am human" checkbox.
If I click the button manually or with selenium session, I'm prompted with a picture captcha, which increasing the complexity of bypassing the Cloudflare protection.
When a Cloudflare CAPTCHA or Javascript challenge is solved a cf_clearance cookie is set in the client browser. The cf_clearance cookie has a default lifetime of 30 minutes, but is configurable by the Cloudflare client.
If you open the OP's target URL manually in a Google Chrome browser you can see the cf_clearance cookie using Developer Tools
It seem that the cf_clearance cookie lifetime is set for 60 minutes based on the UTC time this session started and the expiration date set for the cookie.
So far I haven't found a way to extract this cookie using Python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With