Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding elements by attribute with lxml

I need to parse a xml file to extract some data. I only need some elements with certain attributes, here's an example of document:

<root>     <articles>         <article type="news">              <content>some text</content>         </article>         <article type="info">              <content>some text</content>         </article>         <article type="news">              <content>some text</content>         </article>     </articles> </root> 

Here I would like to get only the article with the type "news". What's the most efficient and elegant way to do it with lxml?

I tried with the find method but it's not very nice:

from lxml import etree f = etree.parse("myfile") root = f.getroot() articles = root.getchildren()[0] article_list = articles.findall('article') for article in article_list:     if "type" in article.keys():         if article.attrib['type'] == 'news':             content = article.find('content')             content = content.text 
like image 264
Jérôme Pigeot Avatar asked Feb 23 '11 15:02

Jérôme Pigeot


People also ask

What is Xpath in lxml?

lxml. etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath).

What is Etree in lxml?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


2 Answers

You can use xpath, e.g. root.xpath("//article[@type='news']")

This xpath expression will return a list of all <article/> elements with "type" attributes with value "news". You can then iterate over it to do what you want, or pass it wherever.

To get just the text content, you can extend the xpath like so:

root = etree.fromstring(""" <root>     <articles>         <article type="news">              <content>some text</content>         </article>         <article type="info">              <content>some text</content>         </article>         <article type="news">              <content>some text</content>         </article>     </articles> </root> """)  print root.xpath("//article[@type='news']/content/text()") 

and this will output ['some text', 'some text']. Or if you just wanted the content elements, it would be "//article[@type='news']/content" -- and so on.

like image 162
Devin Jeanpierre Avatar answered Sep 28 '22 02:09

Devin Jeanpierre


Just for reference, you can achieve the same result with findall:

root = etree.fromstring(""" <root>     <articles>         <article type="news">              <content>some text</content>         </article>         <article type="info">              <content>some text</content>         </article>         <article type="news">              <content>some text</content>         </article>     </articles> </root> """)  articles = root.find("articles") article_list = articles.findall("article[@type='news']/content") for a in article_list:     print a.text 
like image 20
Kjir Avatar answered Sep 28 '22 02:09

Kjir