<p>I need to parse a xml file to extract some data. I only need some elements with certain attributes, here's an example of document:</p> <pre class="prettyprint"><code><root> <articles> <article type="news"> <content>some text</content> </article> <article type="info"> <content>some text</content> </article> <article type="news"> <content>some text</content> </article> </articles> </root> </code></pre> <p>Here I would like to get only the article with the type "news". What's the most efficient and elegant way to do it with lxml?</p> <p>I tried with the find method but it's not very nice:</p> <pre class="prettyprint"><code>from lxml import etree f = etree.parse("myfile") root = f.getroot() articles = root.getchildren()[0] article_list = articles.findall('article') for article in article_list: if "type" in article.keys(): if article.attrib['type'] == 'news': content = article.find('content') content = content.text </code></pre>

<p>You can use xpath, e.g. <code>root.xpath("//article[@type='news']")</code></p> <p>This xpath expression will return a list of all <code><article/></code> elements with "type" attributes with value "news". You can then iterate over it to do what you want, or pass it wherever.</p> <p>To get just the text content, you can extend the xpath like so:</p> <pre class="prettyprint"><code>root = etree.fromstring(""" <root> <articles> <article type="news"> <content>some text</content> </article> <article type="info"> <content>some text</content> </article> <article type="news"> <content>some text</content> </article> </articles> </root> """) print root.xpath("//article[@type='news']/content/text()") </code></pre> <p>and this will output <code>['some text', 'some text']</code>. Or if you just wanted the content elements, it would be <code>"//article[@type='news']/content"</code> -- and so on.</p>

<p>Just for reference, you can achieve the same result with <code>findall</code>:</p> <pre class="prettyprint"><code>root = etree.fromstring(""" <root> <articles> <article type="news"> <content>some text</content> </article> <article type="info"> <content>some text</content> </article> <article type="news"> <content>some text</content> </article> </articles> </root> """) articles = root.find("articles") article_list = articles.findall("article[@type='news']/content") for a in article_list: print a.text </code></pre>

finding elements by attribute with lxml

Tags:

python

find

attributes

lxml

I need to parse a xml file to extract some data. I only need some elements with certain attributes, here's an example of document:

<root>     <articles>         <article type="news">              <content>some text</content>         </article>         <article type="info">              <content>some text</content>         </article>         <article type="news">              <content>some text</content>         </article>     </articles> </root>

Here I would like to get only the article with the type "news". What's the most efficient and elegant way to do it with lxml?

I tried with the find method but it's not very nice:

from lxml import etree f = etree.parse("myfile") root = f.getroot() articles = root.getchildren()[0] article_list = articles.findall('article') for article in article_list:     if "type" in article.keys():         if article.attrib['type'] == 'news':             content = article.find('content')             content = content.text

264

asked Feb 23 '11 15:02

Jérôme Pigeot

2 Answers

You can use xpath, e.g. root.xpath("//article[@type='news']")

This xpath expression will return a list of all <article/> elements with "type" attributes with value "news". You can then iterate over it to do what you want, or pass it wherever.

To get just the text content, you can extend the xpath like so:

root = etree.fromstring(""" <root>     <articles>         <article type="news">              <content>some text</content>         </article>         <article type="info">              <content>some text</content>         </article>         <article type="news">              <content>some text</content>         </article>     </articles> </root> """)  print root.xpath("//article[@type='news']/content/text()")

and this will output ['some text', 'some text']. Or if you just wanted the content elements, it would be "//article[@type='news']/content" -- and so on.

162

answered Sep 28 '22 02:09

Devin Jeanpierre

Just for reference, you can achieve the same result with findall:

root = etree.fromstring(""" <root>     <articles>         <article type="news">              <content>some text</content>         </article>         <article type="info">              <content>some text</content>         </article>         <article type="news">              <content>some text</content>         </article>     </articles> </root> """)  articles = root.find("articles") article_list = articles.findall("article[@type='news']/content") for a in article_list:     print a.text

answered Sep 28 '22 02:09

Kjir

Related questions
                            
                                Why doesn't Python have switch-case?
                            
                                Installing specific BUILD of an anaconda package
                            
                                How do I create a Python set with only one element?
                            
                                Convert ndarray from float64 to integer
                            
                                JavaScript parser in Python [closed]
                            
                                python argparse choices with a default choice
                            
                                Regular Expression to match cross platform newline characters
                            
                                compiling vim with python support
                            
                                Case insensitive unique model fields in Django?
                            
                                How do I log from my Python Spark script
                            
                                Django: dependencies reference nonexistent parent node
                            
                                Iterate through dictionary values?
                            
                                Why is Django throwing error "DisallowedHost at /"?
                            
                                Getting file path of imported module [duplicate]
                            
                                How to get the last exception object after an error is raised at a Python prompt?
                            
                                Using Alembic API from inside application code
                            
                                How to setup a pipenv Python 3.6 project if OS Python version is 3.5?
                            
                                Are multiple classes in a single file recommended? [duplicate]
                            
                                How do nested functions work in Python?
                            
                                Get particular row as series from pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With