Python text processing: NLTK and pandas

Tags:

I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.

I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.

My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:

import pandas as pd
import nltk

pd.options.display.max_colwidth = 10000

txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581 

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45

txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
    txt_lines.append(line)

txt = str(txt_lines)
len(txt)
Out[14]: 1668813

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086

Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).

As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.

Is there any way to improve the above code?

Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.

Many thanks.

436

asked Jan 14 '16 07:01

IVR

1 Answers

The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:

word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]

df = pd.DataFrame(random_word_list, columns=['text'])
df.head()

                                                text
0  Aaru Aaronic abandonable abandonedly abaction ...
1  abampere abampere abacus aback abalone abactor...
2  abaisance abalienate abandonedly abaff abacina...
3  Ababdeh abalone abac abaiser abandonable abact...
4  abandonable abandon aba abaiser abaft Abama ab...

len(df)

50

txt = df.text.apply(word_tokenize)
txt.head()

0    [Aaru, Aaronic, abandonable, abandonedly, abac...
1    [abampere, abampere, abacus, aback, abalone, a...
2    [abaisance, abalienate, abandonedly, abaff, ab...
3    [Ababdeh, abalone, abac, abaiser, abandonable,...
4    [abandonable, abandon, aba, abaiser, abaft, Ab...

txt.apply(len)

0     1000
1     1000
2     1000
3     1000
4     1000
....
44    1000
45    1000
46    1000
47    1000
48    1000
49    1000
Name: text, dtype: int64

As a result, you get the .count() for each row entry:

txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()

0    27
1    24
2    17
3    25
4    32

You can then sum the result using:

txt.sum()

1239

161

answered Nov 15 '22 16:11

Stefan

Related questions
                            
                                How to run tests django rest framework tests?
                            
                                4 dimensional array of zeros in python
                            
                                How could I get the RAW pixel data out of a .NEF file using python?
                            
                                How to use Ubuntu 14.04 on AWS Elastic Beanstalk for a Python Django app
                            
                                Creating a dark, reversed color palette in Seaborn
                            
                                How to test Retry in Celery application in Python?
                            
                                Iterate over dictionary of objects
                            
                                2D Nearest Neighbor Interpolation in Python
                            
                                Numpy: Average of values corresponding to unique coordinate positions
                            
                                calculate histogram peaks in python
                            
                                Submit a job to an asyncio event loop
                            
                                Why does deepcopy fail with "KeyError: '__deepcopy__'" when copying custom object?
                            
                                pandas groupby with sum() on large csv file?
                            
                                Grouped linear regression in Spark
                            
                                How do I access a variable in sphinx conf.py from my .rst file?
                            
                                Extract nested JSON embedded as string in Pandas dataframe
                            
                                Unexpected output using Pythons' ternary operator in combination with lambda
                            
                                How to reschedule 403 HTTP status codes to be crawled later in scrapy?
                            
                                Numpy types for Cython users
                            
                                Determining if A Value is in a Set in TensorFlow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python text processing: NLTK and pandas

Tags:

python

pandas

machine-learning

nltk

IVR

People also ask

1 Answers

Stefan

Recent Activity

Donate For Us