I am interested in knowing if there are any open source projects (preferably in Python) which can be used to download (crawl?) the mailing list archives of open source projects such as Lucene/Hadoop (such as http://mail-archives.apache.org/mod_mbox/lucene-java-user/). I am specially looking for a crawler/downloader customized for (Apache) mailing list archives (not a generic crawler such as Scrappy). Any pointers are highly appreciated. Thank you.
There's usually facilities for downloading mbox files. In the link you provided, you can for example append the mbox name and get the mail archive directly. Example, the mbox for October 2012:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201210.mbox
So getting the archives programmatically is pretty straightforward. Once you have them:
import mailbox
mails = mailbox.mbox(filename.mbox)
for message in mails: print message['subject']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With