I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with: HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail. <ol> <li> I need to find the middle table with the search result rows (this one I was able to figure out): <pre class="prettyprint"><code> self.mySearchTables = self.mySearchTree.findall(".//table") self.myResultRows = self.mySearchTables[1].findall(".//tr") </code></pre> </li> <li> I need to find the links contained in this table (this is where I'm getting stuck): <pre class="prettyprint"><code> for searchRow in self.myResultRows: searchLink = patentRow.findall(".//a") </code></pre> It doesn't seem to actually locate the link elements. </li> <li>I need the plain text of the link. I imagine it would be something like <code>searchLink.text</code> if I actually got the link elements in the first place.</li> </ol> Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface. However, I personally prefer Ian Bicking's HTML parser included in lxml. Secondly, <code>.find()</code> and <code>.findall()</code> come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree. Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml <code>xpath()</code> method or, if you are already familiar with CSS, using the <code>cssselect()</code> method. Here are some examples, with an HTML string parsed like this: <pre class="prettyprint"><code>from lxml.html import fromstring mySearchTree = fromstring(your_input_string) </code></pre> Using the css selector class your program would roughly look something like this: <pre class="prettyprint"><code># Find all 'a' elements inside 'tr' table rows with css selector for a in mySearchTree.cssselect('tr a'): print 'found "%s" link to href "%s"' % (a.text, a.get('href')) </code></pre> The equivalent using xpath method would be: <pre class="prettyprint"><code># Find all 'a' elements inside 'tr' table rows with xpath for a in mySearchTree.xpath('.//tr/*/a'): print 'found "%s" link to href "%s"' % (a.text, a.get('href')) </code></pre>

Need python lxml syntax help for parsing html

Tags:

python

html-parsing

lxml

I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:

HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.

I need to find the middle table with the search result rows (this one I was able to figure out):

    self.mySearchTables = self.mySearchTree.findall(".//table")
    self.myResultRows = self.mySearchTables[1].findall(".//tr")

I need to find the links contained in this table (this is where I'm getting stuck):
```
    for searchRow in self.myResultRows:
        searchLink = patentRow.findall(".//a")
```
It doesn't seem to actually locate the link elements.
I need the plain text of the link. I imagine it would be something like searchLink.text if I actually got the link elements in the first place.

Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?

800

asked Mar 02 '09 17:03

Shaheeb Roshan

2 Answers

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.

However, I personally prefer Ian Bicking's HTML parser included in lxml.

Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.

Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.

Here are some examples, with an HTML string parsed like this:

from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)

Using the css selector class your program would roughly look something like this:

# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

The equivalent using xpath method would be:

# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

114

answered Nov 07 '22 11:11

Van Gale

Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

answered Nov 07 '22 11:11

zweiterlinde

Related questions
                            
                                How to calculate a partial Area Under the Curve (AUC)
                            
                                selenium.common.exceptions.WebDriverException: Message: connection refused
                            
                                Copy a list of list by value and not reference [duplicate]
                            
                                Git push via GitPython
                            
                                Relocation R_X86_64_32S against '_Py_NotImplementedStruct' can not be used when making a shared object; recompile with -fPIC
                            
                                Using 'while' loops in a list comprehension
                            
                                Python Pandas Copy Columns
                            
                                How to hide lines in matplotlib? [duplicate]
                            
                                Unexpected result from `in` operator - Python [duplicate]
                            
                                Slicing multiple column ranges from a dataframe using iloc
                            
                                Python package - aiohttp has a warning message "Unclosed client session"
                            
                                Can pip install from setup.cfg, as if installing from a requirements file?
                            
                                Is there a way to use `json.dump` with `gzip`?
                            
                                How to remove Python 3.6 completely from Ubuntu 18.04
                            
                                How to get feature Importance in naive bayes?
                            
                                Python - How to fix “ValueError: not enough values to unpack (expected 2, got 1)” [closed]
                            
                                Cannot install RAY
                            
                                How do I create a new column in a dataframe from an existing column using conditions?
                            
                                What does pip-compile do? What is it's use?
                            
                                What can Pygame do in terms of graphics that wxPython can't?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With