I'm trying to learn using NLTK package in python. In particular, I need to use penn tree bank dataset in NLTK. As far as I know, If I call nltk.download('treebank')
I can get the 5% of the dataset. However, I have a complete dataset in tar.gz file and I want to use it. In here it is said that:
If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank:
So, I opened the python from terminal, imported nltk and typed nltk.download('ptb')
. With this command, "ptb" directory has been created under my ~/nltk_data
directory. At the end, now I have ~/nltk_data/ptb
directory. Inside there, as suggested in the link I gave above, I've put my dataset folder. So this is my final directory hierarchy.
$: pwd
$: ~/nltk_data/corpora/ptb/WSJ
$: ls
$:00 02 04 06 08 10 12 14 16 18 20 22 24
01 03 05 07 09 11 13 15 17 19 21 23 merge.log
Inside all of the folders from 00 to 24, there are many .mrg
files such as wsj_0001.mrg , wsj_0002.mrg
and so on.
Now, Lets return my question. Again, according to here :
I should be able to obtain the file ids if I write the followings:
>>> from nltk.corpus import ptb
>>> print(ptb.fileids()) # doctest: +SKIP
['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...]
Unfortunately, when I type print(ptb.fileids())
I got empty array.
>>> print(ptb.fileids())
[]
Is there anyone who could help me ?
EDIT here is the content of my ptb directory and some of allcats.txt file :
$: pwd
$: ~/nltk_data/corpora/ptb
$: ls
$: allcats.txt WSJ
$: cat allcats.txt
$: WSJ/00/WSJ_0001.MRG news
WSJ/00/WSJ_0002.MRG news
WSJ/00/WSJ_0003.MRG news
WSJ/00/WSJ_0004.MRG news
WSJ/00/WSJ_0005.MRG news
and so on ..
Penn Treebank (PTB) dataset, is widely used in machine learning for NLP (Natural Language Processing) research.
Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/.
The PTB corpus reader needs uppercase directory and file names (as hinted by the contents of allcats.txt
that you included in your question). This clashes with many distributions of Penn Treebank out there, which use lowercase.
A quick fix for this would be renaming the folders wsj
and brown
and their contents to uppercase. A UNIX command you can use for this is:
find . -depth | \
while read LONG
do
SHORT=$( basename "$LONG" | tr '[:lower:]' '[:upper:]' )
DIR=$( dirname "$LONG" )
if [ "${LONG}" != "${DIR}/${SHORT}" ]
then
mv "${LONG}" "${DIR}/${SHORT}"
fi
done
(Obtained from this question). It will change directory and file names to uppercase recursively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With