I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to what is provided in the Wikipedia dumps (download.wikimedia.com).
I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc..
What sort of text are you looking for?
There are many free e-books (fiction and non-fiction) in .txt format available at Project Gutenberg.
They also have large DVD images full of books available for download.
NLTK provides a simple Python API to access many text corpora, including Gutenberg, Reuters, Shakespeare, and others.
>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With