Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml is not found within Beautiful Soup

I am trying to use beautifulsoup4 to parse a series of webpages written in XHTML. I am assuming that for best results, I should pair with an xml parser, and the only one supported by beautifulsoup to my knowledge is lxml.

However, when I try to run the following as per the beautifuloup documentation:

import requests

from bs4 import BeautifulSoup 
r = requests.get(‘hereiswhereiputmyurl’)
soup = BeautifulSoup(r.content, ‘xml’)

it results in the following error:

FeatureNotFound: Couldn't find a tree builder with the features you    
requested: xml. Do you need to install a parser library?

Its driving me crazy. I have found record of two other users who posted the same problem

Here How to re-install lxml?

and Here bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

I used this post (see link directly below this line) to reinstall and update lxml and also updated beautiful soup, but I am still getting the error. Installing lxml, libxml2, libxslt on Windows 8.1

Beautifulsoup is working otherwise because I ran the following code and it presented me with its usual wall of markup language soup = BeautifulSoup(r.content, ‘html.parser’)

Here are my specs Windows 8.1 Python 3.5.2 I use the spyder ide in Anaconda 3 to run my code (which admittedly, I do not know much about)

I'm sure its a messup that a beginner would do because as I stated before I have very little programming experience.

How can i resolve this issue, or if it is a known bug, would you guys recommend that I just use lxml by itself to scrape the data.

like image 759
Kevin Avatar asked Jul 28 '16 06:07

Kevin


2 Answers

This is a pretty old post, but I had this problem today and found the solution. You need to have lxml installed. Open the terminal and type

pip3 install lxml

Now restart the dev environment (VS Code, Jupyter notebook or whatever) and it should work.

like image 152
Eeshaan Avatar answered Oct 14 '22 04:10

Eeshaan


I think the problem is r.content. Normally it gives the raw content of the response, which is not necessarily an HTML page, it can be json, etc.
Try feeding r.text to soup.

soup = BeautifulSoup(r.text, ‘lxml’)

Better:

r.encoding='utf-8'

then

page = r.text

soup = BeautifulSoup(page, 'lxml')

if you are going to parse xml, you can use 'lxml-xml' as parser.

like image 21
Kaan E. Avatar answered Oct 14 '22 04:10

Kaan E.