I have a question about flattening or collapsing a dataframe from several columns in one row with information about a key to several rows each with the same key column and the appropriate data. Suppose a dataframe is something like this:
df = pd.DataFrame({'CODE': ['AA', 'BB', 'CC'],
'START_1': ['1990-01-01', '2000-01-01', '2005-01-01'],
'END_1': ['1990-02-14', '2000-03-01', '2005-12-31'],
'MEANING_1': ['SOMETHING', 'OR', 'OTHER'],
'START_2': ['1990-02-15', None, '2006-01-01'],
'END_2': ['1990-06-14', None, '2006-12-31'],
'MEANING_2': ['ELSE', None, 'ANOTHER']})
CODE START_1 END_1 MEANING_1 START_2 END_2 MEANING_2
0 AA 1990-01-01 1990-02-14 SOMETHING 1990-02-15 1990-06-14 ELSE
1 BB 2000-01-01 2000-03-01 OR None None None
2 CC 2005-01-01 2005-12-31 OTHER 2006-01-01 2006-12-31 ANOTHER
and I need to get it into a form somewhat like this:
CODE START END MEANING
0 AA 1990-01-01 1990-02-14 SOMETHING
1 AA 1990-02-15 1990-06-14 ELSE
2 BB 2000-01-01 2000-03-01 OR
3 CC 2005-01-01 2005-12-31 OTHER
4 CC 2006-01-01 2006-12-31 ANOTHER
I have a solution as follows:
df_a = df[['CODE', 'START_1', 'END_1', 'MEANING_1']]
df_b = df[['CODE', 'START_2', 'END_2', 'MEANING_2']]
df_a = df_a.rename(index=str, columns={'CODE': 'CODE',
'START_1': 'START',
'END_1': 'END',
'MEANING_1': 'MEANING'})
df_b = df_b.rename(index=str, columns={'CODE': 'CODE',
'START_2': 'START',
'END_2': 'END',
'MEANING_2': 'MEANING'})
df = pd.concat([df_a, df_b], ignore_index=True)
df = df.dropna(axis=0, how='any')
Which yields the desired result. Of course this doesn't seem very pythonic and is clearly not ideal if you have more than 2 column groups which need to be collapsed (I actually have 6 in my real code). I've examined the groupby()
, melt()
and stack()
methods but haven't really found them to be very useful yet. Any suggestions would be appreciated.
Use pd.wide_to_long
:
pd.wide_to_long(df, stubnames=['END', 'MEANING', 'START'],
i='CODE', j='Number', sep='_', suffix='*')
Output:
END MEANING START
CODE Number
AA 1 1990-02-14 SOMETHING 1990-01-01
BB 1 2000-03-01 OR 2000-01-01
CC 1 2005-12-31 OTHER 2005-01-01
AA 2 1990-06-14 ELSE 1990-02-15
BB 2 None None None
CC 2 2006-12-31 ANOTHER 2006-01-01
Then, we can drop Number column/index and dropna's if you wish, e.g. df.reset_index().drop('Number', 1)
.
This is what melt
will achieve this
df1=df.melt('CODE')
df1[['New','New2']]=df1.variable.str.split('_',expand=True)
df1.set_index(['CODE','New2','New']).value.unstack()
Out[492]:
New END MEANING START
CODE New2
AA 1 1990-02-14 SOMETHING 1990-01-01
2 1990-06-14 ELSE 1990-02-15
BB 1 2000-03-01 OR 2000-01-01
2 None None None
CC 1 2005-12-31 OTHER 2005-01-01
2 2006-12-31 ANOTHER 2006-01-01
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With