Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby: How to get a union of strings

Tags:

python

pandas

I have a dataframe like this:

   A         B       C 0  1  0.749065    This 1  2  0.301084      is 2  3  0.463468       a 3  4  0.643961  random 4  1  0.866521  string 5  2  0.120737       ! 

Calling

In [10]: print df.groupby("A")["B"].sum() 

will return

A 1    1.615586 2    0.421821 3    0.463468 4    0.643961 

Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.

A 1    {This, string} 2    {is, !} 3    {a} 4    {random} 

I have been trying to find ways to do this.

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although

df.groupby("A")["B"] 

is a

pandas.core.groupby.SeriesGroupBy object 

so I was hoping any Series method would work. Any ideas?

like image 556
Anne Avatar asked Jul 24 '13 17:07

Anne


People also ask

How do you aggregate strings in Python?

The + operator lets you combine two or more strings in Python. This operator is referred to as the Python string concatenation operator. The + operator should appear between the two strings you want to merge. This code concatenates, or merges, the Python strings “Hello ” and “World”.

What does Group_by do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

How do you get groupby rows in pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

How do you split a groupby?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.


2 Answers

In [4]: df = read_csv(StringIO(data),sep='\s+')  In [5]: df Out[5]:     A         B       C 0  1  0.749065    This 1  2  0.301084      is 2  3  0.463468       a 3  4  0.643961  random 4  1  0.866521  string 5  2  0.120737       !  In [6]: df.dtypes Out[6]:  A      int64 B    float64 C     object dtype: object 

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

In [8]: df.groupby('A').apply(lambda x: x.sum()) Out[8]:     A         B           C A                          1  2  1.615586  Thisstring 2  4  0.421821         is! 3  3  0.463468           a 4  4  0.643961      random 

sum by default concatenates

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum()) Out[9]:  A 1    Thisstring 2           is! 3             a 4        random dtype: object 

You can do pretty much what you want

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x)) Out[11]:  A 1    {This, string} 2           {is, !} 3               {a} 4          {random} dtype: object 

Doing this on a whole frame, one group at a time. Key is to return a Series

def f(x):      return Series(dict(A = x['A'].sum(),                          B = x['B'].sum(),                          C = "{%s}" % ', '.join(x['C'])))  In [14]: df.groupby('A').apply(f) Out[14]:     A         B               C A                              1  2  1.615586  {This, string} 2  4  0.421821         {is, !} 3  3  0.463468             {a} 4  4  0.643961        {random} 
like image 89
Jeff Avatar answered Sep 21 '22 19:09

Jeff


You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.

>>> d    A       B 0  1    This 1  2      is 2  3       a 3  4  random 4  1  string 5  2       ! >>> d.groupby('A')['B'].apply(list) A 1    [This, string] 2           [is, !] 3               [a] 4          [random] dtype: object 

If you want something else, just write a function that does what you want and then apply that.

like image 42
BrenBarn Avatar answered Sep 22 '22 19:09

BrenBarn