Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.

(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)

UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."

UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.


like image 814
unmounted Avatar asked Sep 26 '08 02:09

unmounted


1 Answers

  • Use the Wikipedia dumps
    • needs lots of cleanup
  • See if anything in nltk-data helps you
    • the corpora are usually quite small
  • the Wacky people have some free corpora
    • tagged
    • you can spider your own corpus using their toolkit
  • Europarl is free and the basis of pretty much every academic MT system
    • spoken language, translated
  • The Reuters Corpora are free of charge, but only available on CD

You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.

If you do this commercially, the LDC might be a viable alternative.

like image 187
Torsten Marek Avatar answered Sep 26 '22 00:09

Torsten Marek