I'm trying to learn using NLTK package in python. In particular, I need to use penn tree bank dataset in NLTK. As far as I know, If I call <code>nltk.download('treebank')</code> I can get the 5% of the dataset. However, I have a complete dataset in tar.gz file and I want to use it. In here it is said that: <blockquote> If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank: </blockquote> So, I opened the python from terminal, imported nltk and typed <code>nltk.download('ptb')</code> . With this command, "ptb" directory has been created under my <code>~/nltk_data</code> directory. At the end, now I have <code>~/nltk_data/ptb</code> directory. Inside there, as suggested in the link I gave above, I've put my dataset folder. So this is my final directory hierarchy. <pre class="prettyprint"><code> $: pwd $: ~/nltk_data/corpora/ptb/WSJ $: ls $:00 02 04 06 08 10 12 14 16 18 20 22 24 01 03 05 07 09 11 13 15 17 19 21 23 merge.log </code></pre> Inside all of the folders from 00 to 24, there are many <code>.mrg</code> files such as <code>wsj_0001.mrg , wsj_0002.mrg</code> and so on. Now, Lets return my question. Again, according to here : I should be able to obtain the file ids if I write the followings: <pre class="prettyprint"><code>>>> from nltk.corpus import ptb >>> print(ptb.fileids()) # doctest: +SKIP ['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...] </code></pre> Unfortunately, when I type <code>print(ptb.fileids())</code> I got empty array. <pre class="prettyprint"><code>>>> print(ptb.fileids()) [] </code></pre> Is there anyone who could help me ? EDIT here is the content of my ptb directory and some of allcats.txt file : <pre class="prettyprint"><code> $: pwd $: ~/nltk_data/corpora/ptb $: ls $: allcats.txt WSJ $: cat allcats.txt $: WSJ/00/WSJ_0001.MRG news WSJ/00/WSJ_0002.MRG news WSJ/00/WSJ_0003.MRG news WSJ/00/WSJ_0004.MRG news WSJ/00/WSJ_0005.MRG news and so on .. </code></pre>

The PTB corpus reader needs uppercase directory and file names (as hinted by the contents of <code>allcats.txt</code> that you included in your question). This clashes with many distributions of Penn Treebank out there, which use lowercase. A quick fix for this would be renaming the folders <code>wsj</code> and <code>brown</code> and their contents to uppercase. A UNIX command you can use for this is: <pre class="prettyprint"><code>find . -depth | \ while read LONG do SHORT=$( basename "$LONG" | tr '[:lower:]' '[:upper:]' ) DIR=$( dirname "$LONG" ) if [ "${LONG}" != "${DIR}/${SHORT}" ] then mv "${LONG}" "${DIR}/${SHORT}" fi done </code></pre> (Obtained from this question). It will change directory and file names to uppercase recursively.

how could I use complete penn treebank dataset inside python/nltk

Tags:

python

nlp

nltk

corpus

penn-treebank

I'm trying to learn using NLTK package in python. In particular, I need to use penn tree bank dataset in NLTK. As far as I know, If I call nltk.download('treebank') I can get the 5% of the dataset. However, I have a complete dataset in tar.gz file and I want to use it. In here it is said that:

If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank:

So, I opened the python from terminal, imported nltk and typed nltk.download('ptb') . With this command, "ptb" directory has been created under my ~/nltk_data directory. At the end, now I have ~/nltk_data/ptb directory. Inside there, as suggested in the link I gave above, I've put my dataset folder. So this is my final directory hierarchy.

    $: pwd
    $: ~/nltk_data/corpora/ptb/WSJ
    $: ls
    $:00  02  04  06  08  10  12  14  16  18  20  22  24
      01  03  05  07  09  11  13  15  17  19  21  23  merge.log

Inside all of the folders from 00 to 24, there are many .mrg files such as wsj_0001.mrg , wsj_0002.mrg and so on.

Now, Lets return my question. Again, according to here :

I should be able to obtain the file ids if I write the followings:

>>> from nltk.corpus import ptb
>>> print(ptb.fileids()) # doctest: +SKIP
['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...]

Unfortunately, when I type print(ptb.fileids()) I got empty array.

>>> print(ptb.fileids())
[]

Is there anyone who could help me ?

EDIT here is the content of my ptb directory and some of allcats.txt file :

   $: pwd
    $: ~/nltk_data/corpora/ptb
    $: ls
    $: allcats.txt  WSJ
    $: cat allcats.txt
    $: WSJ/00/WSJ_0001.MRG news
    WSJ/00/WSJ_0002.MRG news
    WSJ/00/WSJ_0003.MRG news
    WSJ/00/WSJ_0004.MRG news
    WSJ/00/WSJ_0005.MRG news

    and so on ..

448

asked Mar 18 '16 08:03

zwlayer

1 Answers

The PTB corpus reader needs uppercase directory and file names (as hinted by the contents of allcats.txt that you included in your question). This clashes with many distributions of Penn Treebank out there, which use lowercase.

A quick fix for this would be renaming the folders wsj and brown and their contents to uppercase. A UNIX command you can use for this is:

find . -depth | \
    while read LONG 
    do 
        SHORT=$( basename "$LONG" | tr '[:lower:]' '[:upper:]' )
        DIR=$( dirname "$LONG" ) 
        if [ "${LONG}" != "${DIR}/${SHORT}"  ] 
        then 
            mv "${LONG}" "${DIR}/${SHORT}" 
        fi 
    done

(Obtained from this question). It will change directory and file names to uppercase recursively.

160

answered Nov 01 '22 11:11

freieschaf

Related questions
                            
                                file.close() exception handling inside a with statement in Python
                            
                                scikit-learn: fitting data into chunks vs fitting it all at once
                            
                                Python custom sort function for strings like 'Season Year'
                            
                                Segfault when import_array not in same translation unit
                            
                                Saving model instance with DateTimeField in Django Admin loses microsecond resolution
                            
                                How can I send a message from a flask route to a socket using flask-socketio
                            
                                What is the difference between pip install and sudo pip install?
                            
                                Better way to share memory for multiprocessing in Python?
                            
                                ImportError: dynamic module does not define module export function (PyInit__caffe)
                            
                                Check if a list has one or more strings that match a regex
                            
                                Python subprocess introduces spaces
                            
                                What is the point of the Sphinx highlight_language config option if code-block:: doesn't have an optional argument?
                            
                                Cannot import pyodbc on Mac
                            
                                check unittest.mock call arguments agnostically w.r.t. whether they have been passed as positional arguments or keyword arguments
                            
                                Insert data in AWS Redshift via AWS Lambda
                            
                                Pandas and Cassandra: numpy array format incompatibility
                            
                                python __main__ and __init__ proper usage
                            
                                SciPy interp2D for pairs of coordinates
                            
                                Create possible combinations of specific size
                            
                                How can I use values read from TFRecords as arguments to tf.reshape?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With