Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up beautifulsoup

I'm running a scraper of this course website and I'm wondering whether there's a faster way to scrape the page once I have it put into beautifulsoup. It takes way longer than I would have expected.

Tips?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup

driver = webdriver.PhantomJS()
driver.implicitly_wait(10) # seconds
driver.get("https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx")
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))

parsedClasses = {}

for i in range(len(select.options)):
    print i
    select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
    select.options[i].click()
    upperLevelClassButton = driver.find_element_by_id("Body_Level500")
    upperLevelClassButton.click()
    driver.find_element_by_name("ctl00$Body$ctl15").click()

    soup = BeautifulSoup(driver.page_source, "lxml")

    courses = soup.select(".CrsOpen")
    for course in courses:
        courseName = course.find_next(class_="ResultTable")["id"][13:]
        parsedClasses[courseName] = []
        print courseName
        for section in course.select(".SecOpen"):
            classInfo = section.find_all_next(class_="ItemRowCenter")
            parsedClasses[courseName].append((int(classInfo[0].string), int(classInfo[1].string), int(classInfo[2].string)))

print parsedClasses
print parsedClasses['FL2014' + 'A46' + '3284']

driver.quit()
like image 552
tbondwilkinson Avatar asked Aug 28 '14 01:08

tbondwilkinson


Video Answer


2 Answers

I'm gonna post this hidden gem in hopes that it might help someone as it helped me a lot:

Just make sure you're passing string object to BeautifulSoup and not bytes.

If you're using requests, do this

page = requests.get(some_url)
soup = BeautifulSoup(page.text, 'html.parser')

instead of this

page = requests.get(some_url)
soup = BeautifulSoup(page.content, 'html.parser')

I don't know the reason behind this, author of referenced article doesn't either, but it sure made my code almost 4 times faster.

Speeding Up BeautifulSoup With Large XML Files, James Hodgkinson

like image 102
Veselina Kolova Avatar answered Oct 18 '22 18:10

Veselina Kolova


According to beautifulsoup docs:

You can speed up encoding detection significantly by installing the cchardet library.

Assuming you are already using lxml as the parser for beautifulsoup (which the OP is), you can speed it up significantly (10x - link) by just installing and importing cchardet.

like image 45
fantabolous Avatar answered Oct 18 '22 17:10

fantabolous