Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write a python script that goes through the links on a page recursively

Tags:

python

I'm doing a project for my school in which I would like to compare scam mails. I found this website: http://www.419scam.org/emails/ Now what I would like to do is to save every scam in apart documents then later on I can analyse them. Here is my code so far:

import BeautifulSoup, urllib2

address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()

This saves me the whole html file in a text format, now I would like to strip the file and save the content of the html links to the scams:

<a href="2011-12/01/index.htm">01</a> 
<a href="2011-12/02/index.htm">02</a> 
<a href="2011-12/03/index.htm">03</a>

etc.

If i get that, I would still need to go a step further and open save another href. Any idea how do I do it in one python code?

Thank you!

like image 617
01000001 Avatar asked Dec 09 '22 23:12

01000001


1 Answers

You picked the right tool in BeautifulSoup. Technically you could do it all do it in one script, but you might want to segment it, because it looks like you'll be dealing with tens of thousands of e-mails, all of which are seperate requests - and that will take a while.

This page is gonna help you a lot, but here's just a little code snippet to get you started. This gets all of the html tags that are index pages for the e-mails, extracts their href links and appends a bit to the front of the url so they can be accessed directly.

from bs4 import BeautifulSoup
import re
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.419scam.org/emails/"))
tags = soup.find_all(href=re.compile("20......../index\.htm")
links = []
for t in tags:
    links.append("http://www.419scam.org/emails/" + t['href'])

're' is a Python's regular expressions module. In the fifth line, I told BeautifulSoup to find all the tags in the soup whose href attribute match that regular expression. I chose this regular expression to get only the e-mail index pages rather than all of the href links on that page. I noticed that the index page links had that pattern for all of their URLs.

Having all the proper 'a' tags, I then looped through them, extracting the string from the href attribute by doing t['href'] and appending the rest of the URL to the front of the string, to get raw string URLs.

Reading through that documentation, you should get an idea of how to expand these techniques to grab the individual e-mails.

like image 111
Paul Whalen Avatar answered May 23 '23 19:05

Paul Whalen