Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using British National Corpus in NLTK

I am new to NLTK (http://www.nltk.org/), and python for that matter. I wish to use the NLTK python library, but use the BNC for the corpus. I do not believe this corpus is distributed through the NLTK Data download. Is there a way to import the BNC corpus to be used by NLTK. If so, how? I did find a function called BNCCorpusReader but have no idea how to use it. Also, at the BNC site, I was able to download the corpus (http://ota.ox.ac.uk/desc/2554).

http://www.nltk.org/api/nltk.corpus.reader.html?highlight=bnc#nltk.corpus.reader.BNCCorpusReader.word

Update

I have tried entrophy's suggestion, but get the following error:

raise IOError('No such file or directory: %r' % _path)
OSError: No such file or directory: 'C:\\Users\\jason\\Documents\\NetBeansProjects\\DemoCollocations\\src\\Corpora\\bnc\\A\\A0\\A00.xml'

My code to read in the corpora:

bnc_reader = BNCCorpusReader(root="Corpora/bnc", fileids=r'[A-K]/\w*/\w*\.xml')

And by corpora is located in: C:\Users\jason\Documents\NetBeansProjects\DemoCollocations\src\Corpora\bnc\

like image 660
jason Avatar asked Apr 19 '17 21:04

jason


1 Answers

In regards to examples usage of nltk for collocation extraction, take a look at the following guide: A how-to guide by nltk on collocations extraction

As far as BNC corpus reader is concerned, all the information was right there in the documentation.

from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')

#And say you wanted to extract all bigram collocations and 
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.

list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)

print(scored)

The output of that will look something like this:

[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699), 
 (('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894), 
 ((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]

And if you wanted to sort them using the score, you could try something like this

sorted_bigrams = sorted(bigram for bigram, score in scored)

print(sorted_bigrams)

Resulting:

[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'), 
 ('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'), 
 ('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]
like image 183
entrophy Avatar answered Dec 06 '22 23:12

entrophy