I have a dataframe of tweets and I'm looking to group the dataframe by date and generate a column that contains a cumulative list of all the unique users who have posted up to that date. None of the existing functions (e.g., cumsum) would appear to work for this. Here's a sample of the original tweet dataframe, where the index (created_at) is in datetime format: <pre class="prettyprint"><code>In [3]: df Out[3]: screen_name created_at 04-01-16 Bob 04-01-16 Bob 04-01-16 Sally 04-01-16 Sally 04-02-16 Bob 04-02-16 Miguel 04-02-16 Tim </code></pre> I can collapse the dataset by date and get a column with the unique users per day: <pre class="prettyprint"><code>In [4]: df[['screen_name']].groupby(df.index.date).aggregate(lambda x: set(list(x))) Out[4]: from_user_screen_name 2016-04-02 {Bob, Sally} 2016-04-03 {Bob, Miguel, Tim} </code></pre> So far so good. But what I'd like is to have a "cumulative set" like this: <pre class="prettyprint"><code>Out[4]: Cumulative_list_up_to_this_date Cumulative_number_of_unique_users 2016-04-02 {Bob, Sally} 2 2016-04-03 {Bob, Sally, Miguel, Tim} 4 </code></pre> Ultimately, what I am really interested in is the cumulative number in the last column so I can plot it. I've considered looping over dates and other things but can't seem to find a good way. Thanks in advance for any help.

You cannot add sets, but can add lists! So build a list of users, then take the cumulative sum and finally apply the set constructor to get rid of duplicates. <pre class="prettyprint"><code>cum_names = (df['screen_name'].groupby(df.index.date) .agg(lambda x: list(x)) .cumsum() .apply(set)) # 2016-04-01 {Bob, Sally} # 2016-04-02 {Bob, Miguel, Tim, Sally} # dtype: object cum_count = cum_names.apply(len) # 2016-04-01 2 # 2016-04-02 4 # dtype: int64 </code></pre>

Cumulative Set in PANDAS

Tags:

python

pandas

I have a dataframe of tweets and I'm looking to group the dataframe by date and generate a column that contains a cumulative list of all the unique users who have posted up to that date. None of the existing functions (e.g., cumsum) would appear to work for this. Here's a sample of the original tweet dataframe, where the index (created_at) is in datetime format:

In [3]: df
Out[3]: 
            screen_name 
created_at  
04-01-16    Bob 
04-01-16    Bob
04-01-16    Sally
04-01-16    Sally
04-02-16    Bob
04-02-16    Miguel
04-02-16    Tim

I can collapse the dataset by date and get a column with the unique users per day:

In [4]: df[['screen_name']].groupby(df.index.date).aggregate(lambda x: set(list(x)))

Out[4]:             from_user_screen_name
        2016-04-02  {Bob, Sally}
        2016-04-03  {Bob, Miguel, Tim}

So far so good. But what I'd like is to have a "cumulative set" like this:

Out[4]:             Cumulative_list_up_to_this_date   Cumulative_number_of_unique_users
        2016-04-02  {Bob, Sally}                      2
        2016-04-03  {Bob, Sally, Miguel, Tim}         4

Ultimately, what I am really interested in is the cumulative number in the last column so I can plot it. I've considered looping over dates and other things but can't seem to find a good way. Thanks in advance for any help.

319

asked Sep 21 '16 17:09

Gregory Saxton

1 Answers

You cannot add sets, but can add lists! So build a list of users, then take the cumulative sum and finally apply the set constructor to get rid of duplicates.

cum_names = (df['screen_name'].groupby(df.index.date)
                              .agg(lambda x: list(x))
                              .cumsum()
                              .apply(set))
# 2016-04-01                 {Bob, Sally}
# 2016-04-02    {Bob, Miguel, Tim, Sally}
# dtype: object

cum_count = cum_names.apply(len)
# 2016-04-01    2
# 2016-04-02    4
# dtype: int64

116

answered Oct 04 '22 17:10

A. Garcia-Raboso

Related questions
                            
                                Intermediate results from joblib
                            
                                How to read timezone aware datetimes as a timezone naive local DatetimeIndex with read_csv in pandas?
                            
                                Listing users for certain DB with PyMongo
                            
                                How to find the diameter of objects using image processing in Python?
                            
                                filtering dataframe on multiple conditions
                            
                                How to Sort Two Columns by Descending Order in Pandas?
                            
                                How do I trim a .fits image and keep world coordinates for plotting in astropy Python?
                            
                                Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer
                            
                                Python multiprocessing/threading takes longer than single processing on a virtual machine
                            
                                tf.contrib.layers.embedding_column from tensor flow
                            
                                python pandas get index boundaries from a series of Booleans
                            
                                Clarification about Python imports
                            
                                Patching a parent class
                            
                                Download files from an FTP server containing given string using Python
                            
                                How can I set the colors per value when coloring plots by a DataFrame column?
                            
                                What Should the Structure of virtualenv Environment Look Like
                            
                                NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected
                            
                                django attribute error : object has no attribute 'get_bound_field'
                            
                                Django - Why are variables declared in Model Classes Static
                            
                                Best way to do file transfer via SCP using python and a .pem file [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With