Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge pandas DataFrame columns starting with the same letters

Let's say I have a DataFrame:

>>> df = pd.DataFrame({'a1':[1,2],'a2':[3,4],'b1':[5,6],'b2':[7,8],'c':[9,0]})
>>> df
   a1  a2  b1  b2  c
0   1   3   5   7  9
1   2   4   6   8  0
>>> 

And I want to merge (maybe not merge, but concatenate) the columns where their name's first letter are equal, such as a1 and a2 and others... but as we see, there is a c column which is by itself without any other similar ones, therefore I want them to not throw errors, instead add NaNs to them.

I want to merge in a way that it will change a wide DataFrame into a long DataFrame, basically like a wide to long modification.

I already have a solution to the problem, but only thing is that it's very inefficient, I would like a more efficient and faster solution (unlike mine :P), I currently have a for loop and a try except (ugh, sounds bad already) code such as:

>>> df2 = pd.DataFrame()
>>> for i in df.columns.str[:1].unique():
    try:
        df2[i] = df[[x for x in df.columns if x[:1] == i]].values.flatten()
    except:
        l = df[[x for x in df.columns if x[:1] == i]].values.flatten().tolist()
        df2[i] = l + [pd.np.nan] * (len(df2) - len(l))


>>> df2
   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN
>>> 

I would like to obtain the same results with better code.

like image 653
U12-Forward Avatar asked Jun 07 '19 03:06

U12-Forward


2 Answers

I'd recommend melt, followed by pivot. To resolve duplicates, you'll need to pivot on a cumcounted column.

u = df.melt()
u['variable'] = u['variable'].str[0]  # extract the first letter
u.assign(count=u.groupby('variable').cumcount()).pivot('count', 'variable', 'value')

variable    a    b    c
count                  
0         1.0  5.0  9.0
1         2.0  6.0  0.0
2         3.0  7.0  NaN
3         4.0  8.0  NaN

This can be re-written as,

u = df.melt()
u['variable'] = [x[0] for x in u['variable']]
u.insert(0, 'count', u.groupby('variable').cumcount())

u.pivot(*u)

variable    a    b    c
count                  
0         1.0  5.0  9.0
1         2.0  6.0  0.0
2         3.0  7.0  NaN
3         4.0  8.0  NaN

If performance matters, here's an alternative with pd.concat:

from operator import itemgetter

pd.concat({
    k: pd.Series(g.values.ravel()) 
    for k, g in df.groupby(operator.itemgetter(0), axis=1)
}, axis=1)

   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN
like image 176
cs95 Avatar answered Sep 17 '22 14:09

cs95


Use dictionary comprehension :

df = pd.DataFrame({i: pd.Series(x.to_numpy().ravel()) 
                      for i, x in df.groupby(lambda x: x[0], axis=1)})
print (df)
   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN
like image 23
jezrael Avatar answered Sep 18 '22 14:09

jezrael