Speeding up beautifulsoup

Question

I'm running a scraper of this course website and I'm wondering whether there's a faster way to scrape the page once I have it put into beautifulsoup. It takes way longer than I would have expected.

Tips?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup

driver = webdriver.PhantomJS()
driver.implicitly_wait(10) # seconds
driver.get("https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx")
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))

parsedClasses = {}

for i in range(len(select.options)):
    print i
    select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
    select.options[i].click()
    upperLevelClassButton = driver.find_element_by_id("Body_Level500")
    upperLevelClassButton.click()
    driver.find_element_by_name("ctl00$Body$ctl15").click()

    soup = BeautifulSoup(driver.page_source, "lxml")

    courses = soup.select(".CrsOpen")
    for course in courses:
        courseName = course.find_next(class_="ResultTable")["id"][13:]
        parsedClasses[courseName] = []
        print courseName
        for section in course.select(".SecOpen"):
            classInfo = section.find_all_next(class_="ItemRowCenter")
            parsedClasses[courseName].append((int(classInfo[0].string), int(classInfo[1].string), int(classInfo[2].string)))

print parsedClasses
print parsedClasses['FL2014' + 'A46' + '3284']

driver.quit()

Veselina Kolova · Accepted Answer

I'm gonna post this hidden gem in hopes that it might help someone as it helped me a lot:

Just make sure you're passing string object to BeautifulSoup and not bytes.

If you're using requests, do this

page = requests.get(some_url)
soup = BeautifulSoup(page.text, 'html.parser')

instead of this

page = requests.get(some_url)
soup = BeautifulSoup(page.content, 'html.parser')

I don't know the reason behind this, author of referenced article doesn't either, but it sure made my code almost 4 times faster.

Speeding Up BeautifulSoup With Large XML Files, James Hodgkinson

fantabolous · Answer

According to beautifulsoup docs:

You can speed up encoding detection significantly by installing the cchardet library.

Assuming you are already using lxml as the parser for beautifulsoup (which the OP is), you can speed it up significantly (10x - link) by just installing and importing cchardet.

Speeding up beautifulsoup

Tags:

python

html-parsing

beautifulsoup

selenium

web-scraping

tbondwilkinson

Video Answer

2 Answers

Veselina Kolova

fantabolous

Recent Activity

Donate For Us

Speeding up beautifulsoup

Tags:

python

html-parsing

beautifulsoup

selenium

web-scraping

tbondwilkinson

Video Answer

2 Answers

Veselina Kolova

fantabolous

Related questions

Recent Activity

Donate For Us