Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup sometimes gives exceptions

The strange thing is that sometimes the BeautifulSoup object does give the desired data, but other times I get an error like or listindex error or out of range or nonetype object does not have attribute findNext(), which is data that is nested inside other elements.

This is the code :

url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

a = soup.find(text=('Socket')).find_next('dd').string

print(a)
like image 328
Y_Lakdime Avatar asked Nov 10 '22 21:11

Y_Lakdime


1 Answers

The actual problem is that the cell value is not always Socket, sometimes it is surrounded with tabs or spaces. Instead of checking for the exact text match, pass a compiled regular expression pattern:

import re

soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True)

Always prints 1150.


Explaining that "sometimes" word I've used (thanks to @carpetsmoker for the initial proposal in comments):

if you open up the page, then, clean up the cookies and refresh the page, you may see two different looks of the same page:

As you can see, the blocks on the page are arranged differently. Hence, the same page has two different looks and the HTML source - what you see is an AB-testing technique:

In marketing and business intelligence, A/B testing is jargon for a randomized experiment with two variants, A and B, which are the control and treatment in the controlled experiment. It is a form of statistical hypothesis testing with two variants leading to the technical term, Two-sample hypothesis testing, used in the field of statistics.

In other words, they are experimenting with the product page and gathering stats, like click-rate, number of sales made etc.


FYI, Here's the working code I've got at the moment:

import re

from bs4 import BeautifulSoup
import requests

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
session.get('http://www.computerstore.nl', headers=headers)

response = session.get('http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html', headers=headers)
soup = BeautifulSoup(response.content)
print(soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True))
like image 155
alecxe Avatar answered Nov 15 '22 13:11

alecxe