I have a large text file over 1GB of chat data (chat.txt) in the following format:
john|12-02-1999|hello#,there#,how#,are#,you#,tom$
tom|12-02-1999|hey#,john$,hows#, it#, goin#
mary|12-03-1999|hello#,boys#,fancy#,meetin#,ya'll#,here#
...
...
john|12-02-2000|well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$
mary|12-03-2000|catch#,you#,on#,the#,flipside#,tom$,and#,john$
I want to process this text and summarize the word counts for certain keywords(say 500 words - hello, nice, like.... dinner, no) for each users separately. This process also involves removing all trailing special characters from each word
The output would look like
user hello nice like ..... dinner No
Tom 10000 500 300 ..... 6000 0
John 6000 1200 200 ..... 3000 5
Mary 23 9000 10000 ..... 100 9000
This is my current pythonic solution:
chat_data = pd.read_csv("chat.txt", sep="|", names =["user","date","words"])
user_lst = chat_data.user.unique()
user_grouped_data= pd.DataFrame(columns=["user","words"])
user_grouped_data['user']=user_lst
for i,row in user_grouped_data.iterrows():
id = row["user"]
temp = chat_data[chat_data["user"]==id]
user_grouped_data.loc[i,"words"] = ",".join(temp["words"].tolist())
result = pd.DataFrame(columns=[ "user", "hello", "nice", "like","...500 other keywords...", "dinner", "no"])
result["user"]= user_lst
for i, row in result.iterrows():
id = row["user"]
temp = user_grouped_data[user_grouped_data["user"]==id]
words = temp.values.tolist()[0][1]
word_lst = words.split(",")
word_lst = [item[0:-1] for item in word_lst]
t_dict = Counter(word_lst)
keys = t_dict.keys()
for word in keys:
result.at[i,word]= t_dict.get(word)
result.to_csv("user_word_counts.csv")
This works fine for small data, but when my chat_data becomes over 1gb, this solution becomes very slow and unusable.
Is there any part from below that I can improve upon which would help me process the data more faster?
For a Pandas DataFrame, a basic idea would be to divide up the DataFrame into a few pieces, as many pieces as you have CPU cores, and let each CPU core run the calculation on its piece. In the end, we can aggregate the results, which is a computationally cheap operation. How a multi-core system can process data faster.
I hope to have demonstrated that you can combine the ease of coding in Python with the efficiency of C to improve certain pandas operations with relative ease and achieve incredible speed increases. For more information about why the Python function is so slow check out this article.
If you need to loop through a Pandas dataframe, you should almost always itertuples instead of the iterrows uses the apply method instead of list comprehension to apply the function to each row of the dataframe.
A Pandas DataFrame (left) is stored as one block and is only sent to one CPU core. A Modin DataFrame (right) is partitioned across rows and columns, and each partition can be sent to a different CPU core up to the max cores in the system. The figure above is a simple example.
You can split
the comma-separated column to a list, explode
to a dataframe by that column of lists, groupby
name and the values from the exploded list, unstack
or pivot_table
the dataframe into your desired format and do some final cleaning on the multi-index columns with droplevel()
, reset_index()
, etc.
All of the below is vectorized pandas methods, so hopefully it is quick. Note: The three columns are [0,1,2] in the code below as I read from clipboard and passed headers=None
Input:
df = pd.DataFrame({0: {0: 'john', 1: 'tom', 2: 'mary', 3: 'john', 4: 'mary'},
1: {0: '12-02-1999',
1: '12-02-1999',
2: '12-03-1999',
3: '12-02-2000',
4: '12-03-2000'},
2: {0: 'hello#,there#,how#,are#,you#,tom$ ',
1: 'hey#,john$,hows#, it#, goin#',
2: "hello#,boys#,fancy#,meetin#,ya'll#,here#",
3: 'well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$',
4: 'catch#,you#,on#,the#,flipside#,tom$,and#,john$'}})
Code:
df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
.groupby([0, 2])[2].count()
.rename('Count')
.reset_index()
.set_index([0,2])
.unstack(1)
.fillna(0))
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
2 0 goin it mary and are been boys catch catching ... on \
0 john 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 ... 0.0
1 mary 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 ... 1.0
2 tom 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
2 the there tom tom up well with ya'll you
0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 2.0
1 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0
You could also use .pivot_table
instead of .unstack()
, which saves you this line of code: df.columns = df.columns.droplevel()
:
df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
.groupby([0, 2])[2].count()
.rename('Count')
.reset_index()
.pivot_table(index=0, columns=2, values='Count')
.fillna(0)
.astype(int)
.reset_index())
df
Out[45]:
2 0 goin it mary and are been boys catch catching ... on \
0 john 0 0 1 1 1 1 0 0 1 ... 0
1 mary 0 0 0 1 0 0 1 1 0 ... 1
2 tom 1 1 0 0 0 0 0 0 0 ... 0
2 the there tom tom up well with ya'll you
0 0 1 0 1 1 1 1 0 2
1 1 0 1 0 0 0 0 1 1
2 0 0 0 0 0 0 0 0 0
[3 rows x 31 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With