Using lxml to parse namepaced HTML?

Tags:

This is driving me totally nuts, I've been struggling with it for many hours. Any help would be much appreciated.

I'm using PyQuery 1.2.9 (which is built on top of lxml) to scrape this URL. I just want to get a list of all the links in the .linkoutlist section.

This is my request in full:

response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links

But that returns an empty array. If I use this query instead:

links = doc('#maincontent .linkoutlist')

Then I get this back this HTML:

<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist">
   <h4>Full Text Sources</h4>
   <ul>
      <li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&amp;volume=19&amp;issue=3&amp;spage=125" ref="itool=Abstract&amp;PrId=3159&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Lippincott Williams &amp; Wilkins</a></li>
      <li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&amp;PAGE=linkout&amp;SEARCH=15107654.ui" ref="itool=Abstract&amp;PrId=3682&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li>
   </ul>
   <h4>Other Literature Sources</h4>
   ...
</div>

So the parent selectors do return HTML with lots of <a> tags. This also appears to be valid HTML.

More experimenting reveals that lxml does not like the xmlns attribute on the opening div, for some reason.

How can I ignore this in lxml, and just parse it like regular HTML?

UPDATE: Trying ns_clean, still failing:

    parser = etree.XMLParser(ns_clean=True)
    tree = etree.parse(StringIO(response.content), parser)
    sel = CSSSelector('#maincontent .rprt_all a')
    print sel(tree)

815

asked Apr 10 '15 15:04

Richard

1 Answers

You need to handle namespaces, including an empty one.

Working solution:

from pyquery import PyQuery as pq
import requests


response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')

namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
    print link.attrib.get("title", "No title")

Prints titles of all links matching the selector:

Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource

Or, just set the parser to "html" and forget about namespaces:

links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
    print link.attrib.get("title", "No title")

181

answered Sep 30 '22 18:09

alecxe

Related questions
                            
                                How to efficiently pass function through?
                            
                                Fastest way to create a pandas column conditionally
                            
                                How to create asyncio stream reader/writer for stdin/stdout?
                            
                                Python Redis Queue (rq) - how to avoid preloading ML model for each job?
                            
                                Why can't eval find a variable defined in an outer function?
                            
                                Keras LSTM Autoencoder time-series reconstruction
                            
                                Running docker-compose from python [duplicate]
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
                            
                                Speed up Matplotlib?
                            
                                pypcap support for python 2.7? [closed]
                            
                                Simplest way to run Sphinx on one python file
                            
                                Getting the function for a compiled function object
                            
                                PyOpenCl: how to debug segmentation fault?
                            
                                Memory leak when using strings < 128KB in Python?
                            
                                TF-IDF implementations in python
                            
                                Copy file if it doesn't already exist [duplicate]
                            
                                How does the python interpreter know when to compile and update a .pyc file?
                            
                                Sharing static global data among processes in a Gunicorn / Flask app
                            
                                Pylint: Avoid checking INSIDE DOCSTRINGS (global directive / rcfile)
                            
                                Does Python support object literal property value shorthand, a la ECMAScript 6?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using lxml to parse namepaced HTML?

Tags:

python

html

html-parsing

lxml

pyquery

Richard

People also ask

1 Answers

alecxe

Recent Activity

Donate For Us