I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>
's from the html...
result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')
I want to provide an alternative lxml-based solution.
The solution uses the function provided in lxml.cssselect
import urllib
import lxml.html
from lxml.cssselect import CSSSelector
connection = urllib.urlopen('http://www.yourTargetURL/')
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
print [e.get('href') for e in foundElements]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With