I'm running a scraper of this course website and I'm wondering whether there's a faster way to scrape the page once I have it put into beautifulsoup. It takes way longer than I would have expected.
Tips?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.implicitly_wait(10) # seconds
driver.get("https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx")
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
parsedClasses = {}
for i in range(len(select.options)):
print i
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
select.options[i].click()
upperLevelClassButton = driver.find_element_by_id("Body_Level500")
upperLevelClassButton.click()
driver.find_element_by_name("ctl00$Body$ctl15").click()
soup = BeautifulSoup(driver.page_source, "lxml")
courses = soup.select(".CrsOpen")
for course in courses:
courseName = course.find_next(class_="ResultTable")["id"][13:]
parsedClasses[courseName] = []
print courseName
for section in course.select(".SecOpen"):
classInfo = section.find_all_next(class_="ItemRowCenter")
parsedClasses[courseName].append((int(classInfo[0].string), int(classInfo[1].string), int(classInfo[2].string)))
print parsedClasses
print parsedClasses['FL2014' + 'A46' + '3284']
driver.quit()
I'm gonna post this hidden gem in hopes that it might help someone as it helped me a lot:
Just make sure you're passing string object to BeautifulSoup and not bytes.
If you're using requests, do this
page = requests.get(some_url)
soup = BeautifulSoup(page.text, 'html.parser')
instead of this
page = requests.get(some_url)
soup = BeautifulSoup(page.content, 'html.parser')
I don't know the reason behind this, author of referenced article doesn't either, but it sure made my code almost 4 times faster.
Speeding Up BeautifulSoup With Large XML Files, James Hodgkinson
According to beautifulsoup docs:
You can speed up encoding detection significantly by installing the cchardet library.
Assuming you are already using lxml as the parser for beautifulsoup (which the OP is), you can speed it up significantly (10x - link) by just installing and importing cchardet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With