I have a big pandas Dataframe with fictional persondata. The below is a small example - each person is defined by a number.
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Number':["5569", "3385", "9832", "6457", "5346", "5462", "9873", "2366"] , 'Gender': ['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female'], 'Children': [np.nan, "5569 6457", "5569", np.nan, "6457", "2366", "2366", np.nan]})
df
Number Gender Children
0 5569 Male NaN
1 3385 Male 5569 6457
2 9832 Female 5569
3 6457 Male NaN
4 5346 Female 6457
5 5462 Female 2366
6 9873 Male 2366
7 2366 Female NaN
Some of the people are the children of some of the others. Now I want to make two columns "Mother" and "Father" and fill them with the relevant numbers. I would get those by looking at the "Children" column and then adding someone as the father if they are a male and has the number of the child in "Children" and the same for females as mothers. However, some of the values are NaN and some people have multiple children (they can have more than 4 in the actual dataset).
I've been trying with .isin and similar but I simply can't get it to work.
They expected output for this example would look like this:
df = pd.DataFrame({ 'Number':["5569", "3385", "9832", "6457", "5346", "5462", "9873", "2366"] , 'Gender': ['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female'], 'Children': [np.nan, "5569 6457", "5569", np.nan, "6457", "2366", "2366", np.nan], 'Mother':[9832, np.nan, np.nan,"5346", np.nan, np.nan, np.nan, "5462"], 'Father':["3385", np.nan, np.nan, "3385", np.nan, np.nan, np.nan, "9873"]})
df
Number Gender Children Mother Father
0 5569 Male NaN 9832 3385
1 3385 Male 5569 6457 NaN NaN
2 9832 Female 5569 NaN NaN
3 6457 Male NaN 5346 3385
4 5346 Female 6457 NaN NaN
5 5462 Female 2366 NaN NaN
6 9873 Male 2366 NaN NaN
7 2366 Female NaN 5462 9873
Use
df = df.join(df.assign(Children=df['Children'].str.split(' '))
.explode('Children')
.assign(Children = lambda x: pd.to_numeric(x['Children'],
errors = 'coerce'))
.pivot_table(columns='Gender',
index ='Children',
values = 'Number',
fill_value=0)
.rename(columns = {'Female':'Mother','Male':'Father'}),
on = 'Number')
print(df)
Number Gender Children Mother Father
0 5569 Male NaN 9832.0 3385.0
1 3385 Male 5569 6457 NaN NaN
2 9832 Female 5569 NaN NaN
3 6457 Male NaN 5346.0 3385.0
4 5346 Female 6457 NaN NaN
5 5462 Female 2366 NaN NaN
6 9873 Male 2366 NaN NaN
7 2366 Female NaN 5462.0 9873.0
note that here the number of spaces between the values in each cell of the children column is very important due to the use of Series.str.split
This looks good for me (Only 2 lines :D )
Note: the string with the space => I ignored the space and made a large number
df['MotherNumber'] = np.where(pd.notna(df['Children'].str.strip()) & (df['Gender'] == 'Female'), float('nan'), df['Mother'])
df['FatherNumber'] = np.where(pd.notna(df['Children'].str.strip()) & (df['Gender'] == 'Male'), float('nan'), df['Father'])
print(df)
Number Gender Children Mother Father MotherNumber FatherNumber
0 5569 Male NaN 9832 3385 9832 3385
1 3385 Male 5569 6457 NaN NaN NaN NaN
2 9832 Female 5569 NaN NaN NaN NaN
3 6457 Male NaN 5346 3385 5346 3385
4 5346 Female 6457 NaN NaN NaN NaN
5 5462 Female 2366 NaN NaN NaN NaN
6 9873 Male 2366 NaN NaN NaN NaN
7 2366 Female NaN 5462 9873 5462 9873
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With