This is driving me totally nuts, I've been struggling with it for many hours. Any help would be much appreciated.
I'm using PyQuery 1.2.9 (which is built on top of lxml
) to scrape this URL. I just want to get a list of all the links in the .linkoutlist
section.
This is my request in full:
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links
But that returns an empty array. If I use this query instead:
links = doc('#maincontent .linkoutlist')
Then I get this back this HTML:
<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist">
<h4>Full Text Sources</h4>
<ul>
<li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125" ref="itool=Abstract&PrId=3159&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Lippincott Williams & Wilkins</a></li>
<li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui" ref="itool=Abstract&PrId=3682&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li>
</ul>
<h4>Other Literature Sources</h4>
...
</div>
So the parent selectors do return HTML with lots of <a>
tags. This also appears to be valid HTML.
More experimenting reveals that lxml does not like the xmlns
attribute on the opening div, for some reason.
How can I ignore this in lxml, and just parse it like regular HTML?
UPDATE: Trying ns_clean
, still failing:
parser = etree.XMLParser(ns_clean=True)
tree = etree.parse(StringIO(response.content), parser)
sel = CSSSelector('#maincontent .rprt_all a')
print sel(tree)
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.
You need to handle namespaces, including an empty one.
Working solution:
from pyquery import PyQuery as pq
import requests
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
print link.attrib.get("title", "No title")
Prints titles of all links matching the selector:
Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource
Or, just set the parser
to "html"
and forget about namespaces:
links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
print link.attrib.get("title", "No title")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With