I have a dataframe like this:
A B C 0 1 0.749065 This 1 2 0.301084 is 2 3 0.463468 a 3 4 0.643961 random 4 1 0.866521 string 5 2 0.120737 !
Calling
In [10]: print df.groupby("A")["B"].sum()
will return
A 1 1.615586 2 0.421821 3 0.463468 4 0.643961
Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
A 1 {This, string} 2 {is, !} 3 {a} 4 {random}
I have been trying to find ways to do this.
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although
df.groupby("A")["B"]
is a
pandas.core.groupby.SeriesGroupBy object
so I was hoping any Series method would work. Any ideas?
The + operator lets you combine two or more strings in Python. This operator is referred to as the Python string concatenation operator. The + operator should appear between the two strings you want to merge. This code concatenates, or merges, the Python strings “Hello ” and “World”.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
In [4]: df = read_csv(StringIO(data),sep='\s+') In [5]: df Out[5]: A B C 0 1 0.749065 This 1 2 0.301084 is 2 3 0.463468 a 3 4 0.643961 random 4 1 0.866521 string 5 2 0.120737 ! In [6]: df.dtypes Out[6]: A int64 B float64 C object dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum()
to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum()) Out[8]: A B C A 1 2 1.615586 Thisstring 2 4 0.421821 is! 3 3 0.463468 a 4 4 0.643961 random
sum
by default concatenates
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum()) Out[9]: A 1 Thisstring 2 is! 3 a 4 random dtype: object
You can do pretty much what you want
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x)) Out[11]: A 1 {This, string} 2 {is, !} 3 {a} 4 {random} dtype: object
Doing this on a whole frame, one group at a time. Key is to return a Series
def f(x): return Series(dict(A = x['A'].sum(), B = x['B'].sum(), C = "{%s}" % ', '.join(x['C']))) In [14]: df.groupby('A').apply(f) Out[14]: A B C A 1 2 1.615586 {This, string} 2 4 0.421821 {is, !} 3 3 0.463468 {a} 4 4 0.643961 {random}
You can use the apply
method to apply an arbitrary function to the grouped data. So if you want a set, apply set
. If you want a list, apply list
.
>>> d A B 0 1 This 1 2 is 2 3 a 3 4 random 4 1 string 5 2 ! >>> d.groupby('A')['B'].apply(list) A 1 [This, string] 2 [is, !] 3 [a] 4 [random] dtype: object
If you want something else, just write a function that does what you want and then apply
that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With