Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautifulsoup Cannot FindAll

I'm trying to scrape nature.com to perform some analysis on journal articles. When I execute the following:

import requests
from bs4 import BeautifulSoup
import re

query = "http://www.nature.com/search?journal=nature&order=date_desc"

for page in range (1, 10):
    req = requests.get(query + "&page=" + str(page))
    soup = BeautifulSoup(req.text)
    cards = soup.findAll("li", "mb20 card cleared")
    matches = re.findall('mb20 card cleared', req.text)
    print(len(cards), len(matches))

I expect Beautifulsoup to print "25" (the number of search results) 10 times (one for every page) but it doesn't. Instead, it prints:

14, 25
12, 25
25, 25
15, 25 
15, 25
17, 25
17, 25
15, 25
14, 25

Looking at the html source shows that there should be 25 results returned per page but Beautifulsoup seems to be confused here and I can't figure out why.

Update 1 In case it matters, I'm running on Mac OSX Mavericks using Anaconda Python 2.7.10 and bs4 version 4.3.1

Update 2 I added a regex to show that req.text does indeed contain what I'm looking for but beautifulsoup is not finding it

Update 3 When I run this simple script multiple times, I sometimes get a "Segmentation fault: 11". Not sure why

like image 831
slaw Avatar asked Nov 10 '22 12:11

slaw


1 Answers

There are differences between the parsers used by BeautifulSoup under-the-hood.

If you don't specify a parser explicitly, BeautifulSoup would choose the one based on rank:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

Specify the parser explicitly:

soup = BeautifulSoup(data, 'html5lib')
soup = BeautifulSoup(data, 'html.parser')
soup = BeautifulSoup(data, 'lxml')
like image 140
alecxe Avatar answered Nov 15 '22 11:11

alecxe