Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python concordance command in NLTK

Tags:

python

nlp

nltk

I have a question regarding Python concordance command in NLTK. First, I came through an easy example:

from nltk.book import *

text1.concordance("monstrous")

which worked just fine. Now, I have my own .txt file and I would like to perform the same command. I have a list called "textList" and want to find the word "CNA" so I put command

textList.concordance('CNA') 

Yet, I got the error

AttributeError: 'list' object has no attribute 'concordance'. 

In the example, is the text1 NOT a list? I wonder what is going on here.

like image 940
Phaii Avatar asked Mar 17 '15 22:03

Phaii


3 Answers

.concordance() is a special nltk function. So you can't just call it on any python object (like your list).

More specifically: .concordance() is a method in the Text class of nltk

Basically, if you want to use the .concordance(), you have to instantiate a Text object first, and then call it on that object.

Text

A Text is typically initialized from a given document or corpus. E.g.:

import nltk.corpus  
from nltk.text import Text  
moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))

.concordance()

concordance(word, width=79, lines=25)

Print a concordance for word with the specified context window. Word matching is not case-sensitive.

So I imagine something like this would work (not tested)

import nltk.corpus  
from nltk.text import Text  
textList = Text(nltk.corpus.gutenberg.words('YOUR FILE NAME HERE.txt'))
textList.concordance('CNA')
like image 80
Tim Avatar answered Nov 16 '22 18:11

Tim


I got it woking with this code:

import sys
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

def main():
    if not sys.argv[1]:
        return
    # read text
    text = open(sys.argv[1], "r").read()
    tokens = word_tokenize(text)
    textList = Text(tokens)
    textList.concordance('is')
    print(tokens)



if __name__ == '__main__':
    main()

based on this site

like image 24
ǝlpoodooɟƃuooʞ Avatar answered Nov 16 '22 19:11

ǝlpoodooɟƃuooʞ


In a Jupyter notebook (or a Google Colab notebook), the full process: MS Word file --> text file --> an NLTK object:

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

import docx2txt

myTextFile = docx2txt.process("/mypath/myWordFile")
tokens = word_tokenize(myTextFile)
print(tokens)
textList = Text(tokens)
textList.concordance('contract')
like image 1
NellieK Avatar answered Nov 16 '22 19:11

NellieK