How to make text processing in a pandas df column more faster for large textual data?

Tags:

I have a large text file over 1GB of chat data (chat.txt) in the following format:

john|12-02-1999|hello#,there#,how#,are#,you#,tom$ 
tom|12-02-1999|hey#,john$,hows#, it#, goin#
mary|12-03-1999|hello#,boys#,fancy#,meetin#,ya'll#,here#
...
...
john|12-02-2000|well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$
mary|12-03-2000|catch#,you#,on#,the#,flipside#,tom$,and#,john$

I want to process this text and summarize the word counts for certain keywords(say 500 words - hello, nice, like.... dinner, no) for each users separately. This process also involves removing all trailing special characters from each word

The output would look like

user   hello   nice   like    .....    dinner  No  
Tom    10000   500     300    .....    6000    0
John   6000    1200    200    .....    3000    5
Mary   23      9000    10000  .....    100     9000

This is my current pythonic solution:

chat_data = pd.read_csv("chat.txt", sep="|", names =["user","date","words"])
user_lst = chat_data.user.unique()
user_grouped_data= pd.DataFrame(columns=["user","words"])
user_grouped_data['user']=user_lst

for i,row in user_grouped_data.iterrows():
    id = row["user"]
    temp = chat_data[chat_data["user"]==id]
    user_grouped_data.loc[i,"words"] = ",".join(temp["words"].tolist())

result = pd.DataFrame(columns=[ "user", "hello", "nice", "like","...500 other keywords...", "dinner", "no"])
result["user"]= user_lst

for i, row in result.iterrows():
    id = row["user"]
    temp = user_grouped_data[user_grouped_data["user"]==id]
    words =  temp.values.tolist()[0][1]
    word_lst = words.split(",")
    word_lst = [item[0:-1] for item in word_lst]
    t_dict = Counter(word_lst)
    keys = t_dict.keys()
    for word in keys:
        result.at[i,word]= t_dict.get(word)

result.to_csv("user_word_counts.csv")

This works fine for small data, but when my chat_data becomes over 1gb, this solution becomes very slow and unusable.

Is there any part from below that I can improve upon which would help me process the data more faster?

grouping textual data by user
cleaning textual data in each row by removing trailing special characters
counting words and assigning the word count to the right column

863

asked Oct 11 '20 05:10

TheLastCoder

Video Answer

1 Answers

You can split the comma-separated column to a list, explode to a dataframe by that column of lists, groupby name and the values from the exploded list, unstack or pivot_table the dataframe into your desired format and do some final cleaning on the multi-index columns with droplevel(), reset_index(), etc.

All of the below is vectorized pandas methods, so hopefully it is quick. Note: The three columns are [0,1,2] in the code below as I read from clipboard and passed headers=None

Input:

df = pd.DataFrame({0: {0: 'john', 1: 'tom', 2: 'mary', 3: 'john', 4: 'mary'},
 1: {0: '12-02-1999',
  1: '12-02-1999',
  2: '12-03-1999',
  3: '12-02-2000',
  4: '12-03-2000'},
 2: {0: 'hello#,there#,how#,are#,you#,tom$ ',
  1: 'hey#,john$,hows#, it#, goin#',
  2: "hello#,boys#,fancy#,meetin#,ya'll#,here#",
  3: 'well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$',
  4: 'catch#,you#,on#,the#,flipside#,tom$,and#,john$'}})

Code:

df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
      .groupby([0, 2])[2].count()
      .rename('Count')
      .reset_index()
      .set_index([0,2])
      .unstack(1)
      .fillna(0))
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]: 
2     0   goin   it   mary  and  are  been  boys  catch  catching  ...   on  \
0  john    0.0  0.0    1.0  1.0  1.0   1.0   0.0    0.0       1.0  ...  0.0   
1  mary    0.0  0.0    0.0  1.0  0.0   0.0   1.0    1.0       0.0  ...  1.0   
2   tom    1.0  1.0    0.0  0.0  0.0   0.0   0.0    0.0       0.0  ...  0.0   

2  the  there  tom  tom    up  well  with  ya'll  you  
0  0.0    1.0  0.0   1.0  1.0   1.0   1.0    0.0  2.0  
1  1.0    0.0  1.0   0.0  0.0   0.0   0.0    1.0  1.0

You could also use .pivot_table instead of .unstack(), which saves you this line of code: df.columns = df.columns.droplevel():

df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
      .groupby([0, 2])[2].count()
      .rename('Count')
      .reset_index()
      .pivot_table(index=0, columns=2, values='Count')
      .fillna(0)
      .astype(int)
      .reset_index())
df
Out[45]: 
2     0   goin   it   mary  and  are  been  boys  catch  catching  ...  on  \
0  john      0    0      1    1    1     1     0      0         1  ...   0   
1  mary      0    0      0    1    0     0     1      1         0  ...   1   
2   tom      1    1      0    0    0     0     0      0         0  ...   0   

2  the  there  tom  tom   up  well  with  ya'll  you  
0    0      1    0     1   1     1     1      0    2  
1    1      0    1     0   0     0     0      1    1  
2    0      0    0     0   0     0     0      0    0  

[3 rows x 31 columns]

154

answered Oct 07 '22 12:10

David Erickson

Related questions
                            
                                How to structure imports in a large python project
                            
                                How can i login in instagram with python requests?
                            
                                Getting flake8 returned a non none zero code : 1 in docker
                            
                                Pytorch: IndexError: index out of range in self. How to solve?
                            
                                Compressing list[0], list[1], list[2],... into a simple statement
                            
                                Find the substring avoiding the use of recursive function
                            
                                Why is Python's built-in sum much slower than manual summation?
                            
                                Generate video from numpy arrays with openCV
                            
                                Replace a list of characters with indices in a string in python
                            
                                On a django site I am getting socket cluster error
                            
                                How do you make pylint in VSCode know that it's in a package (so that relative imports work)?
                            
                                Python: Dynamically create class while providing arguments to __init__subclass__()
                            
                                Calculate intersection over union (Jaccard's index) in pandas dataframe
                            
                                botocore.exceptions.SSLError: SSL validation failed on WIndows
                            
                                Have unique index value in Pandas DataFrame
                            
                                Where should I put abstract classes in a python package?
                            
                                What shebang should I use to consistently point to python3?
                            
                                Get starlette request body in the middleware context
                            
                                Replace a pandas column by splitting the text based on "_"
                            
                                Add missing rows based on column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make text processing in a pandas df column more faster for large textual data?

Tags:

python

regex

python-3.x

pandas

dataframe

TheLastCoder

People also ask

Video Answer

1 Answers

David Erickson

Recent Activity

Donate For Us