I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load data into a Java program and sum up occurrences of words (and what % likelihood they are to be what part-of-speech).
I do not want to use a Java library like the Stanford one, I want to play with the corpus data myself.
Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/
All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious.
EDIT: if you want original source data, I think there's some corpuses out there that have their data. However usually the point is to let someone else do the sampling. Also, note this from the the Wikipedia entry: "Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words." So the data for the Brown Corpus is essentially randomized. Even if you had the original texts you might not be able to guess where they sampled.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With