I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.
How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?
Example using urlib and lxml.html:
import urllib
from lxml import html
url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())
for link in page.xpath("//a"):
print "Name", link.text, "URL", link.get("href")
output >>
[('Aathma Liyanage', 'athma.html'),
('Abewardhana Balasuriya', 'abewardhana.html'),
('Aelian Thilakeratne', 'aelian_thi.html'),
('Ahamed Mohideen', 'ahamed.html'),
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With