I am trying to create a term density matrix from a pandas dataframe, so I can rate terms appearing in the dataframe. I also want to be able to keep the 'spatial' aspect of my data (see comment at the end of post for an example of what I mean).
I am new to pandas and NLTK, so I expect my problem to be soluble with some existing tools.
I have a dataframe which contains two columns of interest: say 'title' and 'page'
import pandas as pd
import re
df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ','Split orange','Something else'], 'page':[1, 2, 3, 4]})
df.head()
page title
0 1 Delicious boiled egg
1 2 Fried egg
2 3 Split orange
3 4 Something else
My goal is to clean up the text, and pass terms of interest to a TDM dataframe. I use two functions to help me clean up the strings
import nltk.classify
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
import string
def remove_punct(strin):
'''
returns a string with the punctuation marks removed, and all lower case letters
input: strin, an ascii string. convert using strin.encode('ascii','ignore') if it is unicode
'''
return strin.translate(string.maketrans("",""), string.punctuation).lower()
sw = stopwords.words('english')
def tok_cln(strin):
'''
tokenizes string and removes stopwords
'''
return set(nltk.wordpunct_tokenize(strin)).difference(sw)
And one function which does the dataframe manipulation
def df2tdm(df,titleColumn,placementColumn,newPlacementColumn):
'''
takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn
Inputs: df, a DataFrame containing titleColumn, placementColumn among others
Outputs: tdm_df, a DataFrame containing newPlacementColumn and columns with all the terms in df[titleColumn]
'''
tdm_df = pd.DataFrame(index=df.index, columns=[newPlacementColumn])
tdm_df = tdm_df.fillna(0)
for idx in df.index:
for word in tok_cln( remove_punct(df[titleColumn][idx].encode('ascii','ignore')) ):
if word not in tdm_df.columns:
newcol = pd.DataFrame(index = df.index, columns = [word])
tdm_df = tdm_df.join(newcol)
tdm_df[newPlacementColumn][idx] = df[placementColumn][idx]
tdm_df[word][idx] = 1
return tdm_df.fillna(0,inplace = False)
tdm_df = df2tdm(df,'title','page','pub_page')
tdm_df.head()
This returns
pub_page boiled egg delicious fried orange split something else
0 1 1 1 1 0 0 0 0 0
1 2 0 1 0 1 0 0 0 0
2 3 0 0 0 0 1 1 0 0
3 4 0 0 0 0 0 0 1 1
But it is painfully slow when parsing large sets (output of hundred thousands of rows, thousands of columns). My two questions:
Can I speed up this implementation?
Is there some other tool I could use to get this done?
I want to be able to keep the 'spatial' aspect of my data, for example if 'egg' appears very often in pages 1-10 and then reappears often in pages 500-520, I want to know that.
You can use scikit-learn's CountVectorizer:
In [14]: from sklearn.feature_extraction.text import CountVectorizer
In [15]: countvec = CountVectorizer()
In [16]: countvec.fit_transform(df.title)
Out[16]:
<4x8 sparse matrix of type '<type 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
It returns the term document matrix in sparse representation because such matrix is usually huge and, well, sparse.
For your particular example I guess converting it back to a DataFrame would still work:
In [17]: pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
Out[17]:
boiled delicious egg else fried orange something split
0 1 1 1 0 0 0 0 0
1 0 0 1 0 1 0 0 0
2 0 0 0 0 0 1 0 1
3 0 0 0 1 0 0 1 0
[4 rows x 8 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With