Importing external treebank-style BLLIP corpus using NLTK

Tags:

I have downloaded the BLLIP corpus and would like to import it to NLTK. One way that I have found for doing this is described in the answer of the question How to read corpus of parsed sentences using NLTK in python?. In that answer they are doing it for one data file. I want to do it for a collection of them.

The BLLIP corpus comes as a collection of a few million files, each of which containing a couple of parsed sentences or so. The main folder that contains the data is named bllip_87_89_wsj and it contains 3 subfolders, 1987, 1988, 1989 (one for each year). In subfolder 1987 you have sub-subfolders each containing a number of files corresponding to parsed sentences. A sub-subfolder is named something like w7_001 (for folder 1987) and the file names are w7_001.000, w7_001.001 and so on and so forth.

With all this at hand, my task is the following: Read all files sequentially using NLTK parsers. Then, convert the corpus to a list of lists, where each sublist is a sentence.

The second part is easy, its done with the command corpus_name.sents(). It is the first part of the task that I don't know how to approach.

All suggestions are welcome. I would also especially welcome suggestions that propose alternative, more efficient, approaches to the one I have in mind.

UPDATE:

The parsed sentences of the BLLIP corpus are of the following form:

(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))

In a number of sentences there is a syntactic category of the form (-NONE- *-0) so when I read the corpus *-0 is considered a word. Is there a way to ignore the syntactic category -NONE-. For example, if I had the sentence

(S (NP-SBJ (-NONE- *-0))
  (VP (TO to)
   (VP (VB sell)
    (NP (NP (PRP$#0 its) (NN TV) (NN station))
     (NN advertising)
     (NN representation)
     (NN operation)
     (CC and)
     (NN program)
     (NN production)
     (NN unit))

I would like it to become:

to sell its TV station advertising representation operation and program production unit

and NOT

*-0 to sell its TV station advertising representation operation and program production unit

which it is currently.

541

asked Mar 06 '17 20:03

Orest Xherija

1 Answers

The question you link to is just a little misleading. Indeed that code sample reads just one file, but the nltk's corpus reader interface is designed for reading large collections of files. The obligatory arguments for the reader constructor are the path to the base folder of the corpus and a regexp (an ordinary one, not a "glob") that matches all file names that should be read in. So just adapt the answer to the question by adding the appropriate regexp. (Also add format options if your corpus does not match the BracketParseCorpusReader defaults.) For example:

from nltk.corpus.reader import BracketParseCorpusReader
reader = BracketParseCorpusReader('path/to/bllip_87_89_wsj', r'.*/w\d_.*')

This will match any file whose name begins with w<digit>_, in any subfolder. If you happen to have files that match this pattern but must be excluded (example: w7_001.001-old), you can sharpen the above regexp.

You can use this corpus reader just like you'd use the parsed corpora distributed with the nltk. Note that since you have millions of files, you should avoid constructing a list of the sentences (or even of the filenames). The reader's methods return "views", special objects that allow you to iterate and index into the results without ever loading the entire list of results into memory.

133

answered Oct 25 '22 16:10

alexis

Related questions
                            
                                Python - reading 10 bit integers from a binary file
                            
                                mplot3d animation with transparent background
                            
                                How can a neural network architecture be visualized with Keras?
                            
                                Is there a specific range of unicode code points which can be checked for emojis?
                            
                                'EntryPoint' object has no attribute 'resolve' when using Google Compute Engine
                            
                                glib.GError: Error interpreting JPEG image file (Unsupported marker type 0x05)
                            
                                Python Click: custom error message
                            
                                Python, why is my probabilistic neural network (PNN) always predicting zeros?
                            
                                How to define namespace in python?
                            
                                Cannot import sqlite3 in Python3
                            
                                how to convert all columns from numeric to categorical using Python
                            
                                How to create a public cython function that can receive c++ struct/instance or python object as parameter?
                            
                                SQLAlchemy OperationalError due to Query-invoked autoflush
                            
                                Stratified Train/Validation/Test-split in scikit-learn
                            
                                Can't connect to Cloud SQL using PyMySQL
                            
                                Find array corresponding to minimal values along an axis in another array
                            
                                Creating suitable WAV files for Google Speech API
                            
                                Working Example Of Luminol Anomaly Detection And Correlation Library By Linkedin
                            
                                OpenCV Python Feature Detection: how to provide a mask? (SIFT)
                            
                                When is type(instance) different from instance.__class__?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Importing external treebank-style BLLIP corpus using NLTK

Tags:

python

parsing

nlp

nltk

corpus

Orest Xherija

People also ask

1 Answers

alexis

Recent Activity

Donate For Us