Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get all HTML elements using LXML

I am trying to parse a large div tag in my HTML document and need to get all its HTML and nested tags inside the div. My code:

innerTree = fromstring(str(response.text))
print("The tags inside the target div are")
print innerTree.cssselect('div.story-body__inner')

But it prints:

[<Element div at 0x66daed0>]

I want it to return all the HTML tags inside? How to do this with LXML?

like image 262
Mehdi Avatar asked Feb 01 '26 01:02

Mehdi


1 Answers

LXML is a great library. No need to use BeautiulSoup or any other. Here's how to get the extra information you seek:

# import lxml HTML parser and HTML output function
from __future__ import print_function
from lxml.html import fromstring
from lxml.etree import tostring as htmlstring

# test HTML for demonstration
raw_html = """
    <div class="story-body__inner">
        <p>Test para with <b>subtags</b></p>
        <blockquote>quote here</blockquote>
        <img src="...">
    </div>
"""

# parse the HTML into a tree structure
innerTree = fromstring(raw_html)

# find the divs you want
# first by finding all divs with the given CSS selector
divs = innerTree.cssselect('div.story-body__inner')

# but that takes a list, so grab the first of those
div0 = divs[0]

# print that div, and its full HTML representation
print(div0)
print(htmlstring(div0))

# now to find sub-items

print('\n-- etree nodes')
for e in div0.xpath(".//*"):
    print(e)

print('\n-- HTML tags')
for e in div0.xpath(".//*"):
    print(e.tag)

print('\n-- full HTML text')
for e in div0.xpath(".//*"):
    print(htmlstring(e))

Note that lxml functions like cssselect and xpath return lists of nodes, not single nodes. You have to index into those lists to get the included nodes--even if there is just one.

To get all the sub-tags or sub-HTML can mean several things: getting the ElementTree nodes, getting the tag names, or getting the full HTML text of those nodes. This code demos all three. It does so by using an XPath query. Sometimes CSS selectors are more convenient, sometimes XPath. In this case, the XPath query .//* means "return all nodes with any tag name, at any depth, under the current node`.

The results of running this under Python 2 follow. (The same code runs fine under Python 3, though the output text is slightly different, as etree.tostring returns byte strings not Unicode strings under Python 3.)

<Element div at 0x106eac8e8>
<div class="story-body__inner">
        <p>Test para with <b>subtags</b></p>
        <blockquote>quote here</blockquote>
        <img src="..."/>
    </div>


-- etree nodes
<Element p at 0x106eac838>
<Element b at 0x106eac890>
<Element blockquote at 0x106eac940>
<Element img at 0x106eac998>

-- HTML tags
p
b
blockquote
img

-- full HTML text
<p>Test para with <b>subtags</b></p>
<b>subtags</b>
<blockquote>quote here</blockquote>  
<img src="..."/>
like image 50
Jonathan Eunice Avatar answered Feb 03 '26 14:02

Jonathan Eunice