Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Can I Access the Brown Corpus in Java (aka outside of NLTK)

I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load data into a Java program and sum up occurrences of words (and what % likelihood they are to be what part-of-speech).

I do not want to use a Java library like the Stanford one, I want to play with the corpus data myself.

like image 347
Nate Cook3 Avatar asked Dec 25 '22 18:12

Nate Cook3


1 Answers

Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/

All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious.

EDIT: if you want original source data, I think there's some corpuses out there that have their data. However usually the point is to let someone else do the sampling. Also, note this from the the Wikipedia entry: "Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words." So the data for the Brown Corpus is essentially randomized. Even if you had the original texts you might not be able to guess where they sampled.

like image 63
markspace Avatar answered Dec 27 '22 06:12

markspace