I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else. My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/ This could be a related problem: Requests : No connection adapters were found for, error in Python3 Here is my code: <pre class="prettyprint"><code>from lxml import html import requests page = requests.get('C:\Users\...\sites\site_1.html') tree = html.fromstring(page.text) test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()') print test </code></pre> The traceback that I'm getting reads: <pre class="prettyprint"><code>C:\Python27\python.exe "C:/Users/.../extract_html/extract.py" Traceback (most recent call last): File "C:/Users/.../extract_html/extract.py", line 4, in <module> page = requests.get('C:\Users\...\sites\site_1.html') File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get return request('get', url, params=params, **kwargs) File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request response = session.request(method=method, url=url, **kwargs) File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request resp = self.send(prep, **send_kwargs) File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send adapter = self.get_adapter(url=request.url) File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter raise InvalidSchema("No connection adapters were found for '%s'" % url) requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html' Process finished with exit code 1 </code></pre> You can see that it has something to do with a "connection adapter" but I'm not sure what that means.

If the file is local, you shouldn't be using <code>requests</code> -- just open the file and read it in. <code>requests</code> expects to be talking to a web server. <pre class="prettyprint"><code>with open(r'C:\Users\...site_1.html', "r") as f: page = f.read() tree = html.fromstring(page) </code></pre>

There is a better way for doing it: using <code>parse</code> function instead of <code>fromstring</code> <pre class="prettyprint"><code>tree = html.parse("C:\Users\...site_1.html") print(html.tostring(tree)) </code></pre>

How do I use Python and lxml to parse a local html file?

Tags:

python

python-2.7

I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.

My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/

This could be a related problem: Requests : No connection adapters were found for, error in Python3

Here is my code:

from lxml import html
import requests

page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)

test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')

print test

The traceback that I'm getting reads:

C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
  File "C:/Users/.../extract_html/extract.py", line 4, in <module>
    page = requests.get('C:\Users\...\sites\site_1.html')
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'

Process finished with exit code 1

You can see that it has something to do with a "connection adapter" but I'm not sure what that means.

680

asked Sep 24 '15 15:09

rdevn00b

2 Answers

If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)

145

answered Oct 02 '22 16:10

Bryan Oakley

There is a better way for doing it: using parse function instead of fromstring

tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))

answered Oct 02 '22 15:10

molhamaleh

Related questions
                            
                                Is there a way to outline text with a dark line in PIL?
                            
                                Python - Plotting colored grid based on values
                            
                                Error: Statement expected, found py: Dedent
                            
                                How to get user agent information in Selenium WebDriver with Python
                            
                                Sort a list from an index to another index - python [duplicate]
                            
                                Recursion and return statements
                            
                                How to upload html documentation generated from sphinx to github?
                            
                                How to highlight text in a tkinter Text widget
                            
                                Py2Exe: DLL load failed
                            
                                Turtle Graphics Not Responding
                            
                                TypeError: ‘DoesNotExist’ object is not callable
                            
                                How to maintain state in Python without classes?
                            
                                Where is BeautifulSoup4 hiding?
                            
                                Python Progress Bar THROUGH Logging Module
                            
                                LDA model generates different topics everytime i train on the same corpus
                            
                                No handlers could be found for logger "apscheduler.scheduler"
                            
                                Why does pressing Ctrl-backslash result in core dump?
                            
                                pip, proxy authentication and "Not supported proxy scheme"
                            
                                Django custom command error: unrecognized arguments
                            
                                sklearn: how to get coefficients of polynomial features

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With