I've searched everywhere and what I most found was doc.xpath('//element[@class="classname"]'), but this does not work no matter what I try.
code I'm using
import lxml.html
def check():
data = urlopen('url').read();
return str(data);
doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='test']")
print(el)
It simply prints an empty list.
Edit: How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)
Here's the exact code I'm using.
import lxml.html
from urllib.request import urlopen
import sys
def check():
data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
return data.decode('utf-8', 'ignore');
doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='channel']")
print(el)
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml. etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath).
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
The TopGear page that you use for testing doesn't have any <div class="channel">
elements. But this works (for example):
el = doc.xpath("//div[@class='channel-title-container']")
Or this:
el = doc.xpath("//div[@class='a yb xr']")
To find <div>
elements with a class
attribute that contains the string channel
, you could use
el = doc.xpath("//div[contains(@class, 'channel')]")
You can use lxml.cssselect to simplify class
and id
request: http://lxml.de/dev/cssselect.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With