Web crawler - following links

Tags:

Please bear with me. I am quite new at Python - but having a lot of fun. I am trying to code a web crawler that crawls through election results from the last referendum in Denmark. I have managed to extract all the relevant links from the main page. And now I want Python to follow each of the 92 links and gather 9 pieces of information from each of those pages. But I am so stuck. Hope you can give me a hint.

Here is my code:

Click to copy

import requests
import urllib2 
from bs4 import BeautifulSoup

# This is the original url http://www.kmdvalg.dk/

soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())

my_list = []
all_links = soup.find_all("a")

for link in all_links:
    link2 = link["href"]
    my_list.append(link2)

for i in my_list[1:93]:
    print i

# The output shows all the links that I would like to follow and gather information from. How do I do that?

390

asked Feb 15 '16 21:02

Metods

1 Answers

Here is my solution using lxml. It's similar to BeautifulSoup

Click to copy

import lxml
from lxml import html
import requests

page = requests.get('http://www.kmdvalg.dk/main')
tree = html.fromstring(page.content)
my_list = tree.xpath('//div[@class="LetterGroup"]//a/@href') # grab all link
print 'Length of all links = ', len(my_list)

my_list is a list consist of all links. And now you can use for loop to scrape information inside each page.

We can for loop through each links. Inside each page, you can extract information as example. This is only for the top table.

Click to copy

table_information = []
for t in my_list:
    page_detail = requests.get(t)
    tree = html.fromstring(page_detail.content)
    table_key = tree.xpath('//td[@class="statusHeader"]/text()')
    table_value = tree.xpath('//td[@class="statusText"]/text()') + tree.xpath('//td[@class="statusText"]/a/text()')
    table_information.append(zip([t]*len(table_key), table_key, table_value))

For table below the page,

Click to copy

table_information_below = []
for t in my_list:
    page_detail = requests.get(t)
    tree = html.fromstring(page_detail.content)
    l1 = tree.xpath('//tr[@class="tableRowPrimary"]/td[@class="StemmerNu"]/text()')
    l2 = tree.xpath('//tr[@class="tableRowSecondary"]/td[@class="StemmerNu"]/text()')
    table_information_below.append([t]+l1+l2)

Hope this help!

answered Sep 24 '22 18:09

titipata

Related questions
                            
                                Convert pyodbc.row to int list
                            
                                Write header to a python log file, but only if a record gets written
                            
                                How to multisort list of dictionaries in Python? [duplicate]
                            
                                Getting selected rows in QListWidget
                            
                                python matplotlib table without borders
                            
                                Dendrogram using pandas and scipy
                            
                                How to send image with form data in test case with unittest in flask application?
                            
                                pandas build on Cygwin
                            
                                Access SQLAlchemy model linked with foreign key
                            
                                Checking SHA1 on a String
                            
                                Is python threading or multiprocessing at core of async calls?
                            
                                Read a python variable in a shell script?
                            
                                Filtering negative timedeltas
                            
                                How can I make an extra field required in flask-admin?
                            
                                Find all combination that sum to N with multiple lists
                            
                                Replace port in url using python
                            
                                How to store the name of rows and column index in pandas DataFrame?
                            
                                django rest framework error unit test: TypeError: object of type 'type' has no len()
                            
                                How do I encode a string to bytes in the send method of a socket connection in one line?
                            
                                Why does division near to zero have different behaviors in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Web crawler - following links

Tags:

python

beautifulsoup

web-crawler

Metods

People also ask

1 Answers

titipata

Recent Activity

Donate For Us