Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Python and lxml to parse a local html file?

I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.

My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/

This could be a related problem: Requests : No connection adapters were found for, error in Python3

Here is my code:

from lxml import html
import requests

page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)

test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')

print test

The traceback that I'm getting reads:

C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
  File "C:/Users/.../extract_html/extract.py", line 4, in <module>
    page = requests.get('C:\Users\...\sites\site_1.html')
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'

Process finished with exit code 1

You can see that it has something to do with a "connection adapter" but I'm not sure what that means.

like image 680
rdevn00b Avatar asked Sep 24 '15 15:09

rdevn00b


People also ask

How do I scrape a local HTML file in Python?

BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them.

How do you parse an HTML file in Python?

Parsing name and text attributes of tagsUsing the name attribute of the tag to print its name and the text attribute to print its text along with the code of the tag- ul from the file.

How do I open a local HTML file in Python?

open() to open an HTML file within Python. Call codecs. open(filename, mode, encoding) with filename as the name of the HTML file, mode as "r" , and encoding as "utf-8" to open an HTML file in read-only mode.

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


2 Answers

If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)
like image 145
Bryan Oakley Avatar answered Oct 02 '22 16:10

Bryan Oakley


There is a better way for doing it: using parse function instead of fromstring

tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
like image 41
molhamaleh Avatar answered Oct 02 '22 15:10

molhamaleh