The strange thing is that sometimes the BeautifulSoup object does give the desired data, but other times I get an error like or listindex error
or out of range
or nonetype object does not have attribute findNext()
, which is data that is nested inside other elements.
This is the code :
url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
a = soup.find(text=('Socket')).find_next('dd').string
print(a)
The actual problem is that the cell value is not always Socket
, sometimes it is surrounded with tabs or spaces. Instead of checking for the exact text
match, pass a compiled regular expression pattern:
import re
soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True)
Always prints 1150
.
Explaining that "sometimes" word I've used (thanks to @carpetsmoker for the initial proposal in comments):
if you open up the page, then, clean up the cookies and refresh the page, you may see two different looks of the same page:
As you can see, the blocks on the page are arranged differently. Hence, the same page has two different looks and the HTML source - what you see is an AB-testing technique:
In marketing and business intelligence, A/B testing is jargon for a randomized experiment with two variants, A and B, which are the control and treatment in the controlled experiment. It is a form of statistical hypothesis testing with two variants leading to the technical term, Two-sample hypothesis testing, used in the field of statistics.
In other words, they are experimenting with the product page and gathering stats, like click-rate, number of sales made etc.
FYI, Here's the working code I've got at the moment:
import re
from bs4 import BeautifulSoup
import requests
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
session.get('http://www.computerstore.nl', headers=headers)
response = session.get('http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html', headers=headers)
soup = BeautifulSoup(response.content)
print(soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With