Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using multiple web pages in a web scraper

I've been working on some Python code to be able to get links to social media accounts from government websites, for a research into ease with which municipalities can be contacted. I've managed to adapt some code to work in 2.7, which prints all links to facebook, twitter, linkedin and google+ present on a given input website. The issue I'm currently experiencing is that I'm not looking for links on just the one web page, but on a list of about 200 websites, I have in an Excel file. I have no experience with importing these sorts of lists into Python, so I was wondering if anybody could take a look at the code, and suggest a proper way to set all these web pages as the base_url, if possible;

import cookielib

import mechanize

base_url = "http://www.amsterdam.nl"

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
              'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)

links = {}
for link in br.links():
    if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:
    links[link.url] = {'count': 1, 'texts': [link.text]}

# printing
for link, data in links.iteritems():
print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])
like image 647
Stefan Avatar asked Jan 11 '16 09:01

Stefan


People also ask

How do I scrape multiple pages in selenium?

If we want to scrap more pages, so, we can increase the loop count. Store the page URL in a string variable page_url, and increment its page number count using the for loop counter. Now, Instantiate the Chrome web browser. Open the page URL in Chrome browser using driver object.

How do you scrape multiple pages with an unchanging URL?

Steps to get the data: Open the developer tools in your browser (for Google Chrome it's Ctrl + Shift + I ). Now, go to the XHR tab which is located inside the Network tab. After doing that, click on the next page button. You'll see the following file.


1 Answers

You mentioned that you have a excel file with the list of all the websites right ? Therefore you can export the excel file as a csv file which you can then read values from in your python code.

Here's some more information regarding that.

Here's how to work directly with excel files

You can do something along the lines :

import csv

links = []

with open('urls.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    # Simple example where only a single column of URL's is present
    links = list(csv_reader)

Now links is a list of all the URLs. You can then loop over the list inside a function which fetches the page and scrapes the data.

def extract_social_links(links):
    for link in links:
        base_url = link 

        br = mechanize.Browser()
        cj = cookielib.LWPCookieJar()
        br.set_cookiejar(cj)
        br.set_handle_robots(False)
        br.set_handle_equiv(False)
        br.set_handle_redirect(True)
        br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),     max_time=1)
        br.addheaders = [('User-agent',
          'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
        page = br.open(base_url, timeout=10)

        links = {}
        for link in br.links():
            if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or     link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:
            links[link.url] = {'count': 1, 'texts': [link.text]}

        # printing
        for link, data in links.iteritems():
        print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

As an aside, you should probably split your if conditions to make them more readable.

like image 98
Bhargav Avatar answered Oct 05 '22 05:10

Bhargav