Please bear with me. I am quite new at Python - but having a lot of fun. I am trying to code a web crawler that crawls through election results from the last referendum in Denmark. I have managed to extract all the relevant links from the main page. And now I want Python to follow each of the 92 links and gather 9 pieces of information from each of those pages. But I am so stuck. Hope you can give me a hint.
Here is my code:
import requests
import urllib2
from bs4 import BeautifulSoup
# This is the original url http://www.kmdvalg.dk/
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
my_list = []
all_links = soup.find_all("a")
for link in all_links:
link2 = link["href"]
my_list.append(link2)
for i in my_list[1:93]:
print i
# The output shows all the links that I would like to follow and gather information from. How do I do that?
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
Here is my solution using lxml
. It's similar to BeautifulSoup
import lxml
from lxml import html
import requests
page = requests.get('http://www.kmdvalg.dk/main')
tree = html.fromstring(page.content)
my_list = tree.xpath('//div[@class="LetterGroup"]//a/@href') # grab all link
print 'Length of all links = ', len(my_list)
my_list
is a list consist of all links. And now you can use for loop to scrape information inside each page.
We can for loop through each links. Inside each page, you can extract information as example. This is only for the top table.
table_information = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
table_key = tree.xpath('//td[@class="statusHeader"]/text()')
table_value = tree.xpath('//td[@class="statusText"]/text()') + tree.xpath('//td[@class="statusText"]/a/text()')
table_information.append(zip([t]*len(table_key), table_key, table_value))
For table below the page,
table_information_below = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
l1 = tree.xpath('//tr[@class="tableRowPrimary"]/td[@class="StemmerNu"]/text()')
l2 = tree.xpath('//tr[@class="tableRowSecondary"]/td[@class="StemmerNu"]/text()')
table_information_below.append([t]+l1+l2)
Hope this help!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With