Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml / BeautifulSoup parser warning

Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml with BeautifulSoup as explained here: http://lxml.de/elementsoup.html

Specifically, I want to use lxml, but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml will reject it on its own.

The link above says: "All you need to do is pass it to the fromstring() function:"

from lxml.html.soupparser import fromstring
root = fromstring(tag_soup)

So that's what I'm doing:

URL = 'http://some-place-on-the-internet.com'
html_goo = requests.get(URL).text
root = fromstring(html_goo)

It works in the sense that I can manipulate the HTML just fine after that. My problem is that every time I run the script, I receive this annoying warning:

/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

My problem is perhaps obvious: I'm not instantiating BeautifulSoup myself. I've tried adding the proposed parameter to the fromstring function, but that just gives me the error: TypeError: 'str' object is not callable. Searches online have proven fruitless so far.

I'd like to get rid of that warning message. Help appreciated, thanks in advance.

like image 252
Teekin Avatar asked Apr 26 '18 14:04

Teekin


People also ask

How do you use lxml parser in BeautifulSoup?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

Is lxml faster than BeautifulSoup?

It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.

Is lxml faster than HTML parser?

lxml is faster than html. parser or html5lib parser. This is because lxml parser that you will invoke in beautiful soup is natively written in C ( uses the libxml2 C library ) , hwere as the html. parser is written in python.

What is lxml parser in Python?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


2 Answers

For others init like:

soup = BeautifulSoup(html_doc)

Use

soup = BeautifulSoup(html_doc, 'html.parser')

instead

like image 194
Windsooon Avatar answered Nov 15 '22 19:11

Windsooon


While using the BeautifulSoup, we always do the things like below:

[variable] = BeautifulSoup([contents you want to analyze])

Here is the problem:

If you have installed "lxml" before, BeautifulSoup will automatically notice that it used it as the praser. It's not the error, just a notification.

So how to remove it?

Just do this like below:

[variable] = BeautifulSoup([contents you want to analyze], features = "lxml")

"Based on the latest version of BeautifulSoup, 4.6.3"

Notice that different versions of BeautifulSoup have different ways, or the grammar, to add this pattern, just look at the notice message carefully.

Good luck!

like image 37
Jaylin Avatar answered Nov 15 '22 18:11

Jaylin