Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where can I find get a dump of raw text on the web?

I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to what is provided in the Wikipedia dumps (download.wikimedia.com).

I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc..

like image 227
Jason Avatar asked Aug 02 '10 13:08

Jason


2 Answers

What sort of text are you looking for?

There are many free e-books (fiction and non-fiction) in .txt format available at Project Gutenberg.

They also have large DVD images full of books available for download.

like image 64
Blorgbeard Avatar answered Sep 17 '22 12:09

Blorgbeard


NLTK provides a simple Python API to access many text corpora, including Gutenberg, Reuters, Shakespeare, and others.

>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
like image 22
Cerin Avatar answered Sep 17 '22 12:09

Cerin