I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).
I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.
I am currently trying some combination of the following, but can't get it down:
Attempt 1:
group = df['subreddit'].groupby(df['author']).unique() list(group)
Attempt 2:
from collections import defaultdict subreddit_dict = defaultdict(list) for index, row in df.iterrows(): author = row['author'] subreddit = row['subreddit'] subreddit_dict[author].append(subreddit) for key, value in subreddit_dict.items(): subreddit_dict[key] = set(value) subreddit_df = pd.DataFrame.from_dict(subreddit_dict, orient = 'index')
Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.
To get a count of unique values in a column use pandas, first use Series. unique() function to get unique values from column by removing duplidate values and then call the size to get the count. unique() function returns a ndarray with unique value in order of appearance and the results are not sorted.
Pandas DataFrame nunique() Method The nunique() method returns the number of unique values for each column. By specifying the column axis ( axis='columns' ), the nunique() method searches column-wise and returns the number of unique values for each row.
To count unique values per groups in Python Pandas, we can use df. groupby('column_name').
Here are two strategies to do it. No doubt, there are other ways.
Assuming your dataframe looks something like this (obviously with more columns):
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']}) >>> df author subreddit 0 a sr1 1 a sr2 2 b sr2 ...
SOLUTION 1: groupby
More straightforward than solution 2, and similar to your first attempt:
group = df.groupby('author') df2 = group.apply(lambda x: x['subreddit'].unique()) # Alternatively, same thing as a one liner: # df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
Result:
>>> df2 author a [sr1, sr2] b [sr2]
The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).
If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
df2 = df2.apply(pd.Series)
Result:
>>> df2 0 1 author a sr1 sr2 b sr2 NaN
Solution 2: Iterate through dataframe
you can make a new dataframe with all unique authors:
df2 = pd.DataFrame({'author':df.author.unique()})
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) for _, x in df2.iterrows()]
This gives you this:
>>> df2 author subreddits 0 a [sr2, sr1] 1 b [sr2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With