I am trying to use beautifulsoup4 to parse a series of webpages written in XHTML. I am assuming that for best results, I should pair with an xml parser, and the only one supported by beautifulsoup to my knowledge is lxml.
However, when I try to run the following as per the beautifuloup documentation:
import requests
from bs4 import BeautifulSoup
r = requests.get(‘hereiswhereiputmyurl’)
soup = BeautifulSoup(r.content, ‘xml’)
it results in the following error:
FeatureNotFound: Couldn't find a tree builder with the features you
requested: xml. Do you need to install a parser library?
Its driving me crazy. I have found record of two other users who posted the same problem
Here How to re-install lxml?
and Here bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
I used this post (see link directly below this line) to reinstall and update lxml and also updated beautiful soup, but I am still getting the error. Installing lxml, libxml2, libxslt on Windows 8.1
Beautifulsoup is working otherwise because I ran the following code and it presented me with its usual wall of markup language soup = BeautifulSoup(r.content, ‘html.parser’)
Here are my specs Windows 8.1 Python 3.5.2 I use the spyder ide in Anaconda 3 to run my code (which admittedly, I do not know much about)
I'm sure its a messup that a beginner would do because as I stated before I have very little programming experience.
How can i resolve this issue, or if it is a known bug, would you guys recommend that I just use lxml by itself to scrape the data.
This is a pretty old post, but I had this problem today and found the solution. You need to have lxml installed. Open the terminal and type
pip3 install lxml
Now restart the dev environment (VS Code, Jupyter notebook or whatever) and it should work.
I think the problem is r.content
. Normally it gives the raw content of the response, which is not necessarily an HTML page, it can be json, etc.
Try feeding r.text
to soup.
soup = BeautifulSoup(r.text, ‘lxml’)
Better:
r.encoding='utf-8'
then
page = r.text
soup = BeautifulSoup(page, 'lxml')
if you are going to parse xml, you can use 'lxml-xml'
as parser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With