Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set lxml as default BeautifulSoup parser

Tags:

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:

soup = bs4.BeautifulSoup(html, 'lxml') 

but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

like image 363
Adam Hammes Avatar asked Jan 06 '15 00:01

Adam Hammes


People also ask

How do you use lxml parser with BeautifulSoup?

According to the Specifying the parser to use documentation page: The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you'd like the markup parsed. If you don't specify anything, you'll get the best HTML parser that's installed.

How do I add lxml to BeautifulSoup?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

Is lxml faster than BeautifulSoup?

lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.

What is lxml parser?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


2 Answers

According to the Specifying the parser to use documentation page:

The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

In other words, just installing lxml in the same python environment makes it a default parser.

Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup choose the best parser by itself. You would also have to remember that you need to have lxml installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup would just get the next available parser without throwing any errors.

If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml in your project requirements alongside with beautifulsoup4.

Besides: "Explicit is better than implicit."

like image 115
alecxe Avatar answered Oct 04 '22 01:10

alecxe


Obviously take a look at the accepted answer first. It is pretty good, and as for this technicality:

but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.

# V1 - define a wrapper function - most straight-forward. import bs4  def bs_parse(html):     return bs4.BeautifulSoup(html, 'lxml') # ... html = ... bs_parse(html) 

Or if you feel like showing off ...

import bs4 from functools import partial bs_parse = partial(bs4.BeautifulSoup, features='lxml') # ... html = ... bs_parse(html) 
like image 39
Leonid Avatar answered Oct 04 '22 03:10

Leonid