Creating a new corpus with NLTK

Tags:

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.

I have a bunch of .txt files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data.

I've tried PlaintextCorpusReader but I couldn't get further than:

>>>import nltk >>>from nltk.corpus import PlaintextCorpusReader >>>corpus_root = './' >>>newcorpus = PlaintextCorpusReader(corpus_root, '.*') >>>newcorpus.words()

How do I segment the newcorpus sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader class?

Can you also lead me to how I can write the segmented data into text files?

474

asked Feb 09 '11 23:02

alvas

1 Answers

After some years of figuring out how it works, here's the updated tutorial of

How to create an NLTK corpus with a directory of textfiles?

The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.

If you have a directory that looks like this:

newcorpus/          file1.txt          file2.txt          ...

Simply use these lines of code and you can get a corpus:

import os from nltk.corpus.reader.plaintext import PlaintextCorpusReader  corpusdir = 'newcorpus/' # Directory of corpus.  newcorpus = PlaintextCorpusReader(corpusdir, '.*')

NOTE: that the PlaintextCorpusReader will use the default nltk.tokenize.sent_tokenize() and nltk.tokenize.word_tokenize() to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.

Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:

import os from nltk.corpus.reader.plaintext import PlaintextCorpusReader  # Let's create a corpus with 2 texts in different textfile. txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus.""" txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n""" corpus = [txt1,txt2]  # Make new dir for the corpus. corpusdir = 'newcorpus/' if not os.path.isdir(corpusdir):     os.mkdir(corpusdir)  # Output the files into the directory. filename = 0 for text in corpus:     filename+=1     with open(corpusdir+str(filename)+'.txt','w') as fout:         print>>fout, text  # Check that our corpus do exist and the files are correct. assert os.path.isdir(corpusdir) for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):     assert open(corpusdir+infile,'r').read().strip() == text.strip()   # Create a new corpus by specifying the parameters # (1) directory of the new corpus # (2) the fileids of the corpus # NOTE: in this case the fileids are simply the filenames. newcorpus = PlaintextCorpusReader('newcorpus/', '.*')  # Access each file in the corpus. for infile in sorted(newcorpus.fileids()):     print infile # The fileids of each file.     with newcorpus.open(infile) as fin: # Opens the file.         print fin.read().strip() # Prints the content of the file print  # Access the plaintext; outputs pure string/basestring. print newcorpus.raw().strip() print   # Access paragraphs in the corpus. (list of list of list of strings) # NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and  #       nltk.tokenize.word_tokenize. # # Each element in the outermost list is a paragraph, and # Each paragraph contains sentence(s), and # Each sentence contains token(s) print newcorpus.paras() print  # To access pargraphs of a specific fileid. print newcorpus.paras(newcorpus.fileids()[0])  # Access sentences in the corpus. (list of list of strings) # NOTE: That the texts are flattened into sentences that contains tokens. print newcorpus.sents() print  # To access sentences of a specific fileid. print newcorpus.sents(newcorpus.fileids()[0])  # Access just tokens/words in the corpus. (list of strings) print newcorpus.words()  # To access tokens of a specific fileid. print newcorpus.words(newcorpus.fileids()[0])

Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:

>>> from nltk.tokenize import sent_tokenize, word_tokenize >>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus.""" >>> sent_tokenize(txt1) ['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.'] >>> word_tokenize(sent_tokenize(txt1)[0]) ['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']

181

answered Sep 29 '22 15:09

alvas

Related questions
                            
                                How do I return JSON without using a template in Django?
                            
                                PIP install unable to find ffi.h even though it recognizes libffi
                            
                                Format certain floating dataframe columns into percentage in pandas
                            
                                Mayavi colorbar in TraitsUI creating blank window
                            
                                How to *actually* read CSV data in TensorFlow?
                            
                                Python Setup Disabling Path Length Limit Pros and Cons?
                            
                                Python PDF library [closed]
                            
                                Should I use np.absolute or np.abs?
                            
                                Example of what SQLAlchemy can do, and Django ORM cannot
                            
                                nose vs pytest - what are the (subjective) differences that should make me pick either? [closed]
                            
                                What is the equivalent of php's print_r() in python?
                            
                                Is there a module for balanced binary tree in Python's standard library?
                            
                                ValueError: Length of values does not match length of index | Pandas DataFrame.unique()
                            
                                Python defaultdict and lambda
                            
                                What's the difference between ThreadPool vs Pool in the multiprocessing module?
                            
                                Matplotlib: how to set the current figure?
                            
                                Is it possible to use Python to write cross-platform apps for both iOS and Android?
                            
                                Flattening a list of NumPy arrays?
                            
                                Does the Python 3 interpreter have a JIT feature?
                            
                                Python method/function arguments starting with asterisk and dual asterisk [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating a new corpus with NLTK

Tags:

python

nlp

nltk

corpus

alvas

People also ask

1 Answers

alvas

Recent Activity

Donate For Us