Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Open source project for downloading mailing list archives preferably in Python

I am interested in knowing if there are any open source projects (preferably in Python) which can be used to download (crawl?) the mailing list archives of open source projects such as Lucene/Hadoop (such as http://mail-archives.apache.org/mod_mbox/lucene-java-user/). I am specially looking for a crawler/downloader customized for (Apache) mailing list archives (not a generic crawler such as Scrappy). Any pointers are highly appreciated. Thank you.

like image 550
prashu Avatar asked Dec 27 '22 15:12

prashu


1 Answers

There's usually facilities for downloading mbox files. In the link you provided, you can for example append the mbox name and get the mail archive directly. Example, the mbox for October 2012:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/201210.mbox

So getting the archives programmatically is pretty straightforward. Once you have them:

import mailbox
mails = mailbox.mbox(filename.mbox)
for message in mails: print message['subject']
like image 187
JosefAssad Avatar answered Dec 31 '22 12:12

JosefAssad