Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.
(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)
UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."
UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.
If you do this commercially, the LDC might be a viable alternative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With