Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flatten/concatenate multiple columns with similar information based on one index column in pandas?

Tags:

python

pandas

I have a question about flattening or collapsing a dataframe from several columns in one row with information about a key to several rows each with the same key column and the appropriate data. Suppose a dataframe is something like this:

df = pd.DataFrame({'CODE': ['AA', 'BB', 'CC'],
              'START_1': ['1990-01-01', '2000-01-01', '2005-01-01'],
              'END_1': ['1990-02-14', '2000-03-01', '2005-12-31'],
              'MEANING_1': ['SOMETHING', 'OR', 'OTHER'],
              'START_2': ['1990-02-15', None, '2006-01-01'],
              'END_2': ['1990-06-14', None, '2006-12-31'],
              'MEANING_2': ['ELSE', None, 'ANOTHER']})
  CODE     START_1       END_1  MEANING_1     START_2       END_2 MEANING_2
0   AA  1990-01-01  1990-02-14  SOMETHING  1990-02-15  1990-06-14      ELSE
1   BB  2000-01-01  2000-03-01         OR        None        None      None
2   CC  2005-01-01  2005-12-31      OTHER  2006-01-01  2006-12-31   ANOTHER

and I need to get it into a form somewhat like this:

  CODE       START         END    MEANING
0   AA  1990-01-01  1990-02-14  SOMETHING
1   AA  1990-02-15  1990-06-14       ELSE
2   BB  2000-01-01  2000-03-01         OR
3   CC  2005-01-01  2005-12-31      OTHER
4   CC  2006-01-01  2006-12-31    ANOTHER

I have a solution as follows:

df_a = df[['CODE', 'START_1', 'END_1', 'MEANING_1']]
df_b = df[['CODE', 'START_2', 'END_2', 'MEANING_2']]
df_a = df_a.rename(index=str, columns={'CODE': 'CODE',
                                'START_1': 'START',
                                'END_1': 'END',
                                'MEANING_1': 'MEANING'})
df_b = df_b.rename(index=str, columns={'CODE': 'CODE',
                                'START_2': 'START',
                                'END_2': 'END',
                                'MEANING_2': 'MEANING'})
df = pd.concat([df_a, df_b], ignore_index=True)
df = df.dropna(axis=0, how='any')

Which yields the desired result. Of course this doesn't seem very pythonic and is clearly not ideal if you have more than 2 column groups which need to be collapsed (I actually have 6 in my real code). I've examined the groupby(), melt() and stack() methods but haven't really found them to be very useful yet. Any suggestions would be appreciated.

like image 272
DrPiranoid Avatar asked Mar 07 '23 23:03

DrPiranoid


2 Answers

Use pd.wide_to_long:

pd.wide_to_long(df, stubnames=['END', 'MEANING', 'START'],
                i='CODE', j='Number', sep='_', suffix='*')

Output:

                    END    MEANING       START
CODE Number                                   
AA   1       1990-02-14  SOMETHING  1990-01-01
BB   1       2000-03-01         OR  2000-01-01
CC   1       2005-12-31      OTHER  2005-01-01
AA   2       1990-06-14       ELSE  1990-02-15
BB   2             None       None        None
CC   2       2006-12-31    ANOTHER  2006-01-01

Then, we can drop Number column/index and dropna's if you wish, e.g. df.reset_index().drop('Number', 1).

like image 106
Scott Boston Avatar answered Mar 09 '23 14:03

Scott Boston


This is what melt will achieve this

df1=df.melt('CODE')

df1[['New','New2']]=df1.variable.str.split('_',expand=True)
df1.set_index(['CODE','New2','New']).value.unstack()
Out[492]: 
New               END    MEANING       START
CODE New2                                   
AA   1     1990-02-14  SOMETHING  1990-01-01
     2     1990-06-14       ELSE  1990-02-15
BB   1     2000-03-01         OR  2000-01-01
     2           None       None        None
CC   1     2005-12-31      OTHER  2005-01-01
     2     2006-12-31    ANOTHER  2006-01-01
like image 44
BENY Avatar answered Mar 09 '23 14:03

BENY