Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python text processing: NLTK and pandas

I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.

I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.

My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:

import pandas as pd
import nltk

pd.options.display.max_colwidth = 10000

txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581 

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45

txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
    txt_lines.append(line)

txt = str(txt_lines)
len(txt)
Out[14]: 1668813

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086

Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).

As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.

Is there any way to improve the above code?

Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.

Many thanks.

like image 436
IVR Avatar asked Jan 14 '16 07:01

IVR


People also ask

Is pandas used in NLP?

Natural Language Processing (NLP), not so much. We believe that Pandas has the potential to serve as a universal data structure for NLP data. DataFrames could make every phase of NLP easier, from creating new models, to evaluating their effectiveness, to building applications that integrate those models.

What is NLTK and how it is useful in processing NLP text analysis?

NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.

Is NLP and NLTK same?

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP. A lot of the data that you could be analyzing is unstructured data and contains human-readable text.

Is Python good for text processing?

Python Programming can be used to process text data for the requirements in various textual data analysis. A very important area of application of such text processing ability of python is for NLP (Natural Language Processing).


1 Answers

The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:

word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]

df = pd.DataFrame(random_word_list, columns=['text'])
df.head()

                                                text
0  Aaru Aaronic abandonable abandonedly abaction ...
1  abampere abampere abacus aback abalone abactor...
2  abaisance abalienate abandonedly abaff abacina...
3  Ababdeh abalone abac abaiser abandonable abact...
4  abandonable abandon aba abaiser abaft Abama ab...

len(df)

50

txt = df.text.apply(word_tokenize)
txt.head()

0    [Aaru, Aaronic, abandonable, abandonedly, abac...
1    [abampere, abampere, abacus, aback, abalone, a...
2    [abaisance, abalienate, abandonedly, abaff, ab...
3    [Ababdeh, abalone, abac, abaiser, abandonable,...
4    [abandonable, abandon, aba, abaiser, abaft, Ab...

txt.apply(len)

0     1000
1     1000
2     1000
3     1000
4     1000
....
44    1000
45    1000
46    1000
47    1000
48    1000
49    1000
Name: text, dtype: int64

As a result, you get the .count() for each row entry:

txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()

0    27
1    24
2    17
3    25
4    32

You can then sum the result using:

txt.sum()

1239
like image 161
Stefan Avatar answered Nov 15 '22 16:11

Stefan