Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, for each unique value in one column, get unique values in another column

Tags:

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).

I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.

I am currently trying some combination of the following, but can't get it down:

Attempt 1:

group = df['subreddit'].groupby(df['author']).unique() list(group)  

Attempt 2:

from collections import defaultdict subreddit_dict  = defaultdict(list)  for index, row in df.iterrows():     author = row['author']     subreddit = row['subreddit']     subreddit_dict[author].append(subreddit)  for key, value in subreddit_dict.items():     subreddit_dict[key] = set(value)  subreddit_df = pd.DataFrame.from_dict(subreddit_dict,                              orient = 'index') 
like image 293
Parseltongue Avatar asked Feb 25 '18 23:02

Parseltongue


People also ask

How extract unique values from multiple columns in pandas?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.

How can I get unique values of a column in pandas with Count?

To get a count of unique values in a column use pandas, first use Series. unique() function to get unique values from column by removing duplidate values and then call the size to get the count. unique() function returns a ndarray with unique value in order of appearance and the results are not sorted.

What is Nunique () in pandas?

Pandas DataFrame nunique() Method The nunique() method returns the number of unique values for each column. By specifying the column axis ( axis='columns' ), the nunique() method searches column-wise and returns the number of unique values for each row.

How do I group unique values in pandas?

To count unique values per groups in Python Pandas, we can use df. groupby('column_name').


1 Answers

Here are two strategies to do it. No doubt, there are other ways.

Assuming your dataframe looks something like this (obviously with more columns):

df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})  >>> df   author subreddit 0      a       sr1 1      a       sr2 2      b       sr2 ... 

SOLUTION 1: groupby

More straightforward than solution 2, and similar to your first attempt:

group = df.groupby('author')  df2 = group.apply(lambda x: x['subreddit'].unique())  # Alternatively, same thing as a one liner: # df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique()) 

Result:

>>> df2 author a    [sr1, sr2] b         [sr2] 

The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:

df2 = df2.apply(pd.Series) 

Result:

>>> df2           0    1 author           a       sr1  sr2 b       sr2  NaN 

Solution 2: Iterate through dataframe

you can make a new dataframe with all unique authors:

df2 = pd.DataFrame({'author':df.author.unique()}) 

And then just get the list of all unique subreddits they are active in, assigning it to a new column:

df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']]))      for _, x in df2.iterrows()] 

This gives you this:

>>> df2   author  subreddits 0      a  [sr2, sr1] 1      b       [sr2] 
like image 94
sacuL Avatar answered Oct 29 '22 02:10

sacuL