I'm trying to build a new dataframe from grouping part of the string into a column.
import pandas
df = pandas.DataFrame([{'A': 'string_300_bla1', 'B': "Hi", 'C': 3},
{'A': 'string_300_blaa2', 'B': "Hello", 'C': 4},
{'A': 'string_487_blaaa1', 'B': "nice", 'C': 9},
{'A': 'string_487_blaaa2', 'B': "day", 'C': 6}])
I want to make a groupby from this part of the string
string_300_bla1
I tried:
import re
dfs = df['A'].str.contains('.*_\d+_.*', re.IGNORECASE).groupby(df['B'])
My output:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000279EFD009E8>
Good output:
dfs = pandas.DataFrame([{'A': 'string_300', 'B': "Hi\n\nHello"},
{'A': 'string_487', 'B': "nice\n\nday"}])
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.
The abstract definition of grouping is to provide a mapping of labels to group names. Pandas datasets can be split into any of their objects. There are multiple ways to split data like: Note : In this we refer to the grouping objects as the keys. In order to group data with one key, we pass only one key as an argument in groupby function.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby () function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.
To get some background information, check out How to Speed Up Your Pandas Projects. What may happen with .apply () is that it will effectively perform a Python loop over each group. While the .groupby (...).apply () pattern can provide some flexibility, it can also inhibit Pandas from otherwise using its Cython-based optimizations.
One useful way to inspect a Pandas GroupBy object and see the splitting in action is to iterate over it. This is implemented in DataFrameGroupBy.__iter__ () and produces an iterator of (group, DataFrame) pairs for DataFrames:
We can do:
(df.groupby(df.A.str.extract('(\w+_\d+)')[0])
.agg({'B':'\n\n'.join, 'C':'sum'})
.reset_index()
)
Output:
0 B C
0 string_300 Hi\n\nHello 7
1 string_487 nice\n\nday 15
As pointed out by @CharlesGleason, here's the solution that extract the digit parts:
(df.groupby(df.A.str.extract('\w+_(\d+)')[0])
.agg({'A':'first', 'B':'\n\n'.join, 'C':'sum'})
.reset_index(drop=True)
)
You can use str.rsplit
df.B.groupby(df.A.str.rsplit('_',n=1).str[0]).agg('\n\n'.join).reset_index()
Out[236]:
A B
0 string_300 Hi\n\nHello
1 string_487 nice\n\nday
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With