Selenium versus BeautifulSoup for web scraping

Tags:

I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium.

Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath, what reason is there to use BeautifulSoup when I could just use Selenium for everything?

And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?

568

asked Jul 02 '13 21:07

elie

1 Answers

Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requests or the built-in urllib.request) with lxml or BeautifulSoup, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:

Bandwidth, and time to run your script. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using requests.
Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.

Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.

Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:

JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.

If you do decide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.

170

answered Oct 10 '22 05:10

Mark Amery

Related questions
                            
                                React Native: require() with Dynamic String?
                            
                                javascript, promises, how to access variable this inside a then scope [duplicate]
                            
                                One JS File for Multiple Pages [closed]
                            
                                jquery use of bind vs on click
                            
                                Deprecation warning using this.refs
                            
                                Value returned by the assignment
                            
                                What is the difference between hashHistory and browserHistory in react router?
                            
                                Calling Django `reverse` in client-side Javascript
                            
                                Twitter-style autocomplete in textarea
                            
                                Should I use React.PureComponent everywhere?
                            
                                How can I disable a browser or element scrollbar, but still allow scrolling with wheel or arrow keys?
                            
                                Overriding Browser's Keyboard Shortcuts
                            
                                What does JSX stand for?
                            
                                Prototype keyword in Javascript
                            
                                How do I get current scope dom-element in AngularJS controller?
                            
                                Why are two identical objects not equal to each other?
                            
                                How to modify highcharts legend item click event?
                            
                                Should I use `void 0` or `undefined` in JavaScript [duplicate]
                            
                                Does awaiting a non-Promise have any detectable effect?
                            
                                What is XHTML role attribute? What do you use it for?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Selenium versus BeautifulSoup for web scraping

Tags:

python

javascript

beautifulsoup

selenium

elie

People also ask

1 Answers

Mark Amery

Recent Activity

Donate For Us