I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.
My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/
This could be a related problem: Requests : No connection adapters were found for, error in Python3
Here is my code:
from lxml import html
import requests
page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)
test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')
print test
The traceback that I'm getting reads:
C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
File "C:/Users/.../extract_html/extract.py", line 4, in <module>
page = requests.get('C:\Users\...\sites\site_1.html')
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'
Process finished with exit code 1
You can see that it has something to do with a "connection adapter" but I'm not sure what that means.
BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them.
Parsing name and text attributes of tagsUsing the name attribute of the tag to print its name and the text attribute to print its text along with the code of the tag- ul from the file.
open() to open an HTML file within Python. Call codecs. open(filename, mode, encoding) with filename as the name of the HTML file, mode as "r" , and encoding as "utf-8" to open an HTML file in read-only mode.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
If the file is local, you shouldn't be using requests
-- just open the file and read it in. requests
expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
There is a better way for doing it:
using parse
function instead of fromstring
tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With