Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing external treebank-style BLLIP corpus using NLTK

I have downloaded the BLLIP corpus and would like to import it to NLTK. One way that I have found for doing this is described in the answer of the question How to read corpus of parsed sentences using NLTK in python?. In that answer they are doing it for one data file. I want to do it for a collection of them.

The BLLIP corpus comes as a collection of a few million files, each of which containing a couple of parsed sentences or so. The main folder that contains the data is named bllip_87_89_wsj and it contains 3 subfolders, 1987, 1988, 1989 (one for each year). In subfolder 1987 you have sub-subfolders each containing a number of files corresponding to parsed sentences. A sub-subfolder is named something like w7_001 (for folder 1987) and the file names are w7_001.000, w7_001.001 and so on and so forth.

With all this at hand, my task is the following: Read all files sequentially using NLTK parsers. Then, convert the corpus to a list of lists, where each sublist is a sentence.

The second part is easy, its done with the command corpus_name.sents(). It is the first part of the task that I don't know how to approach.

All suggestions are welcome. I would also especially welcome suggestions that propose alternative, more efficient, approaches to the one I have in mind.

UPDATE:

The parsed sentences of the BLLIP corpus are of the following form:

(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))

In a number of sentences there is a syntactic category of the form (-NONE- *-0) so when I read the corpus *-0 is considered a word. Is there a way to ignore the syntactic category -NONE-. For example, if I had the sentence

(S (NP-SBJ (-NONE- *-0))
  (VP (TO to)
   (VP (VB sell)
    (NP (NP (PRP$#0 its) (NN TV) (NN station))
     (NN advertising)
     (NN representation)
     (NN operation)
     (CC and)
     (NN program)
     (NN production)
     (NN unit))

I would like it to become:

to sell its TV station advertising representation operation and program production unit

and NOT

*-0 to sell its TV station advertising representation operation and program production unit

which it is currently.

like image 541
Orest Xherija Avatar asked Mar 06 '17 20:03

Orest Xherija


People also ask

How do you make corpus in nltk?

Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output: >>> from nltk.

How do I access Brown corpus nltk?

We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read: >>> from nltk. corpus import brown >>> brown.


1 Answers

The question you link to is just a little misleading. Indeed that code sample reads just one file, but the nltk's corpus reader interface is designed for reading large collections of files. The obligatory arguments for the reader constructor are the path to the base folder of the corpus and a regexp (an ordinary one, not a "glob") that matches all file names that should be read in. So just adapt the answer to the question by adding the appropriate regexp. (Also add format options if your corpus does not match the BracketParseCorpusReader defaults.) For example:

from nltk.corpus.reader import BracketParseCorpusReader
reader = BracketParseCorpusReader('path/to/bllip_87_89_wsj', r'.*/w\d_.*')

This will match any file whose name begins with w<digit>_, in any subfolder. If you happen to have files that match this pattern but must be excluded (example: w7_001.001-old), you can sharpen the above regexp.

You can use this corpus reader just like you'd use the parsed corpora distributed with the nltk. Note that since you have millions of files, you should avoid constructing a list of the sentences (or even of the filenames). The reader's methods return "views", special objects that allow you to iterate and index into the results without ever loading the entire list of results into memory.

like image 133
alexis Avatar answered Oct 25 '22 16:10

alexis