Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I access the raw documents from the Brown corpus?

For all other NLTK corpora, calling corpus.raw() yields the original text from the files. For example:

>>> from nltk.corpus import webtext
>>> webtext.raw()[:10]
'Cookie Man'

However, when calling brown.raw() you get tagged text.

>>> from nltk.corpus import brown
>>> brown.raw()[:10]
'\n\n\tThe/at '

I've read all the documentation I can find but can't seem to find an obvious explanation or way to get the un-tagged version. Is there a reason this corpus is tagged and the others aren't?

like image 348
Cassian Corey Avatar asked Nov 15 '17 06:11

Cassian Corey


2 Answers

TL;DR

import nltk
nltk.download('brown')
nltk.download('nonbreaking_prefixes')
nltk.download('perluniprops')

from nltk.corpus import brown
from nltk.tokenize.moses import MosesDetokenizer

mdetok = MosesDetokenizer()

brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]

for sent in brown_natural:
    print(sent)

In Long

It's because the "raw" version of the Brown corpus is tokenized and tagged i.e. the corpus comes tagged an that's the original form of the corpus =)

You can look at the individual files in your nltk_data directory:

$ head -n10 nltk_data/corpora/brown/ca01


    The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.


    The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.


    The/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/`` irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./np ./.

If you want the words from the corpus, you can use brown.words(), e.g.

>>> from nltk.corpus import brown

>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

>>> ' '.join(brown.words()[:30])
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"

If you want to get words from a specific file:

>>> brown.fileids()[:10] # The first 10 fileids from brown.
[u'ca01', u'ca02', u'ca03', u'ca04', u'ca05', u'ca06', u'ca07', u'ca08', u'ca09', u'ca10']

>>> ' '.join(brown.words('ca01')[:30]) # First 30 words from the 'ca01' file.
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"

And the sentences from a specific file:

>>> brown.sents('ca01')
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]

To print out the individual sentences:

>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
...     print(' '.join(sent))
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .

Trying to detokenize the tokenized corpus rather messy and may or may not work but you can try the MosesDetokenizer:

First download the data needed by the MosesDetokenizer:

>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/ltan/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/ltan/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
True

Then initialize the MosesDetokenizer:

>>> from nltk.tokenize.moses import MosesDetokenizer
>>> mdetok = MosesDetokenizer()

And use the MosesDetokenizer.detokenize():

>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
...     # Join the words in sentences and convert the `` -> "
...     # also convert '' -> " and ` -> '
...     munged_sentence = ' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'")
...     print(mdetok.detokenize(munged_sentence.split(), return_str=True)) # MosesDetokenizer expects a list of strings as input.
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..
"Only a relative handful of such reports was received", the jury said, "considering the widespread interest in the election, the number of voters and the size of this city".
The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous".

To convert every sentence in brown into natural reading text:

from nltk.tokenize.moses import MosesDetokenizer
mdetok = MosesDetokenizer()
brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]

[out]:

>>> for sent in brown_natural:
...     print(sent)
...     break
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
like image 177
alvas Avatar answered Oct 07 '22 08:10

alvas


The tagged text is the raw document, the actual content of the Brown corpus files. The raw() method shows you exactly what is stored in the files; it only returs clean text for "plain text" corpora, not for "all other corpora" as you assume. Try nltk.corpus.treebank.raw('wsj_0001.mrg') or nltk.corpus.conll2000.raw("train.txt"), for example, and you'll see trees and "IOB format" text respectively.

Now if your goal is to reconstitute readable text, joining on spaces is usually good enough for me:

for sent in brown.sents():
    print(" ".join(sent))

You'll get output like this:

`` Only a relative handful of such reports was received '' , the jury said , `` considering
the widespread interest in the election , the number of voters and the size of this 
city '' .

If you don't like the way this looks, see the answer by alvas for a more ambitious reconstruction.

like image 33
alexis Avatar answered Oct 07 '22 08:10

alexis