If I have a DataFrame
where each row is an individual and each column individual attributes, how can I get a new DataFrame
which maps each individual to multiple results?
I've tried doing this with DataFrame.apply()
which seems the most intuitive – but it gives exceptions as per the example below. Adding broadcast=False
or reduce=False
does not help.
Below is a trivial example, obviously, but consider any scenario where each row maps to multiple rows. What's the best way to handle that? In reality, each row can map to a different number of results. This is basically computing a one-to-many relationship.
Example: I have a DataFrame
dataset with the following structure where I want to, for each individual, get the three upcoming birthdays (trivial example, I know). So, from:
+---+-------+------------+
| | name | birthdate |
+---+-------+------------+
| 1 | John | 1990-01-01 |
| 2 | Jane | 1957-04-03 |
| 3 | Max | 1987-02-03 |
| 4 | David | 1964-02-12 |
+---+-------+------------+
to something like:
+-------+------------+
| name | birthday |
+-------+------------+
| John | 2016-01-01 |
| John | 2017-01-01 |
| John | 2018-01-01 |
| Jane | 2016-04-03 |
| Jane | 2017-04-03 |
| Jane | 2018-04-03 |
| Max | 2016-02-03 |
| Max | 2017-02-03 |
| Max | 2018-02-03 |
| David | 2016-02-12 |
| David | 2017-02-12 |
| David | 2018-02-12 |
+-------+------------+
Intuitively, I would have attempted something like this:
def get_birthdays(person):
birthdays = []
for year in range(2016, 2019):
birthdays.append({
'name': person.name,
'birthday': person.birthdate.replace(year=year)
})
return pd.DataFrame(birthdays)
# with data as my original DataFrame
data.apply(get_birthdays, axis=1)
However, this raises:
ValueError: could not broadcast input array from shape (3,2) into shape (3)
During handling of the above exception, another exception occurred:
[...]
ValueError: cannot copy sequence with size 2 to array axis with dimension 3
To merge rows within a group together in Pandas we can use the agg(~) method together with the join(~) method to concatenate the row values.
loc and iloc are interchangeable when labels are 0-based integers.
The groupby
version of apply
supports DataFrame
as return value in the way which you intended:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'name': ['John', 'Jane', 'Max', 'David'],
'birthdate': [datetime(1990,1,1), datetime(1957,4,3), datetime(1987,2,3), datetime(1964,2,12)],
})
def get_birthdays(df_x):
d = {'name': [], 'birthday': []}
name = df_x.iloc[0]['name']
original = df_x.iloc[0]['birthdate']
for year in range(2016, 2019):
d['name'].append(name)
d['birthday'].append(original.replace(year=year))
return pd.DataFrame(d)
print df.groupby('name', group_keys=False).apply(get_birthdays).reset_index(drop=True)
Output:
birthday name
0 2016-02-12 David
1 2017-02-12 David
2 2018-02-12 David
3 2016-04-03 Jane
4 2017-04-03 Jane
5 2018-04-03 Jane
6 2016-01-01 John
7 2017-01-01 John
8 2018-01-01 John
9 2016-02-03 Max
10 2017-02-03 Max
11 2018-02-03 Max
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With