Where can I find get a dump of raw text on the web?

Question

I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to what is provided in the Wikipedia dumps (download.wikimedia.com).

I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc..

Blorgbeard · Accepted Answer

What sort of text are you looking for?

There are many free e-books (fiction and non-fiction) in .txt format available at Project Gutenberg.

They also have large DVD images full of books available for download.

Cerin · Answer

NLTK provides a simple Python API to access many text corpora, including Gutenberg, Reuters, Shakespeare, and others.

>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Where can I find get a dump of raw text on the web?

Tags:

text

parsing

nlp

wikipedia

Jason

2 Answers

Blorgbeard

Cerin

Recent Activity

Donate For Us

Where can I find get a dump of raw text on the web?

Tags:

text

parsing

nlp

wikipedia

Jason

2 Answers

Blorgbeard

Cerin

Related questions

Recent Activity

Donate For Us