Write a python script that goes through the links on a page recursively

Question

I'm doing a project for my school in which I would like to compare scam mails. I found this website: http://www.419scam.org/emails/ Now what I would like to do is to save every scam in apart documents then later on I can analyse them. Here is my code so far:

import BeautifulSoup, urllib2

address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()

This saves me the whole html file in a text format, now I would like to strip the file and save the content of the html links to the scams:

<a href="2011-12/01/index.htm">01</a> 
<a href="2011-12/02/index.htm">02</a> 
<a href="2011-12/03/index.htm">03</a>

etc.

If i get that, I would still need to go a step further and open save another href. Any idea how do I do it in one python code?

Thank you!

Paul Whalen · Accepted Answer

You picked the right tool in BeautifulSoup. Technically you could do it all do it in one script, but you might want to segment it, because it looks like you'll be dealing with tens of thousands of e-mails, all of which are seperate requests - and that will take a while.

This page is gonna help you a lot, but here's just a little code snippet to get you started. This gets all of the html tags that are index pages for the e-mails, extracts their href links and appends a bit to the front of the url so they can be accessed directly.

from bs4 import BeautifulSoup
import re
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.419scam.org/emails/"))
tags = soup.find_all(href=re.compile("20......../index\.htm")
links = []
for t in tags:
    links.append("http://www.419scam.org/emails/" + t['href'])

're' is a Python's regular expressions module. In the fifth line, I told BeautifulSoup to find all the tags in the soup whose href attribute match that regular expression. I chose this regular expression to get only the e-mail index pages rather than all of the href links on that page. I noticed that the index page links had that pattern for all of their URLs.

Having all the proper 'a' tags, I then looped through them, extracting the string from the href attribute by doing t['href'] and appending the rest of the URL to the front of the string, to get raw string URLs.

Reading through that documentation, you should get an idea of how to expand these techniques to grab the individual e-mails.

Write a python script that goes through the links on a page recursively

Tags:

python

01000001

1 Answers

Paul Whalen

Recent Activity

Donate For Us

Write a python script that goes through the links on a page recursively

Tags:

python

01000001

1 Answers

Paul Whalen

Related questions

Recent Activity

Donate For Us