Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding html element with class using lxml

I've searched everywhere and what I most found was doc.xpath('//element[@class="classname"]'), but this does not work no matter what I try.

code I'm using

import lxml.html

def check():
    data = urlopen('url').read();
    return str(data);

doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='test']")
print(el)

It simply prints an empty list.

Edit: How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)

Here's the exact code I'm using.

import lxml.html
from urllib.request import urlopen
import sys

def check():
    data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
    return data.decode('utf-8', 'ignore');


doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='channel']")
print(el)
like image 523
Vexx Avatar asked Nov 22 '11 12:11

Vexx


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What is Xpath in lxml?

lxml. etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath).

What is lxml in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


2 Answers

The TopGear page that you use for testing doesn't have any <div class="channel"> elements. But this works (for example):

el = doc.xpath("//div[@class='channel-title-container']")

Or this:

el = doc.xpath("//div[@class='a yb xr']")

To find <div> elements with a class attribute that contains the string channel, you could use

el = doc.xpath("//div[contains(@class, 'channel')]") 
like image 121
mzjn Avatar answered Sep 22 '22 16:09

mzjn


You can use lxml.cssselect to simplify class and id request: http://lxml.de/dev/cssselect.html

like image 23
dmzkrsk Avatar answered Sep 23 '22 16:09

dmzkrsk