Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Expand" pandas dataframe by values in column

Lets say I start with a dataframe that has some data and a column of quantities:

In:  df=pd.DataFrame({'first-name':['Jan','Leilani'],'Qty':[2,4]})

Out: Qty    first-name
     2      Jan
     4      Leilani

I want to create a dataframe that copies and labels the data into new lines a number of times equal to the quantity on each line. Here is what the output should look like:

Qty     first-name  position
2       Jan         1
2       Jan         2
4       Leilani     1
4       Leilani     2
4       Leilani     3
4       Leilani     4

I can do this using python like so:

l=[]
x=0

for idx in df.index:
    x=0
    for _ in range(df.loc[idx]['Qty']):
        x+=1
        tempSrs=df.loc[idx]
        tempSrs['position']=x
        l.append(tempSrs)

outDf=pd.DataFrame(l)

This is very slow. Is there a way to do this using pandas functions? This is effectively an "unpivot", which in pandas is "melt", but I wasn't able to figure out how to use the melt function to accomplish this.

Thanks,

like image 956
Maile Cupo Avatar asked May 09 '18 15:05

Maile Cupo


2 Answers

With repeat and cumcount

Newdf=df.reindex(df.index.repeat(df.Qty))
Newdf['position']=Newdf.groupby(level=0).cumcount()+1
Newdf
Out[931]: 
   Qty first-name position
0    2        jan        1
0    2        jan        2
1    4        jay        1
1    4        jay        2
1    4        jay        3
1    4        jay        4
like image 153
BENY Avatar answered Sep 19 '22 21:09

BENY


This uses almost identical concepts as Wen.

The differences are:

  1. loc instead of reindex (same thing)
  2. assign instead of = assignment (assign produces a copy)
  3. Pass a lambda to assign to embed groupby logic

df.loc[df.index.repeat(df.Qty)].assign(
    position=lambda d: d.groupby('first-name').cumcount() + 1
)

   Qty first-name  position
0    2        jan         1
0    2        jan         2
1    4        jay         1
1    4        jay         2
1    4        jay         3
1    4        jay         4

Construct with np.arange

q = df.Qty.values
r = np.arange(q.sum()) - np.append(0, q[:-1]).cumsum().repeat(q) + 1
df.loc[df.index.repeat(q)].assign(position=r)

   Qty first-name  position
0    2        jan         1
0    2        jan         2
1    4        jay         1
1    4        jay         2
1    4        jay         3
1    4        jay         4
like image 22
piRSquared Avatar answered Sep 22 '22 21:09

piRSquared