Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Pandas - Dataframe: Row Specific Conditional Column Offset

I am trying to do a dataframe transformation that I cannot solve. I have tried multiple approaches from stackoverflow and the pandas documentation: apply, apply(lambda: ...), pivots, and joins. Too many attempts to list here, but not sure which approach is the best or if maybe I tried the right approach with the wrong syntax.

Basically, I have a dataframe, and I need to 1) offset the columns, 2) the number of columns to offset by varies and depends on a variable in the dataframe, 3) create columns at the end of the dataframe where needed to accommodate the offset, and 4) place zeros in the newly created intervals.

df1 = pd.DataFrame({'first' : ['John', 'Mary', 'Larry', 'jerry'], '1' : [5.5, 6.0,10,20], '2' : [100, 200, 300, 400], '3' : [150, 100, 240, 110], 'offset' : ([1,0,2,1])})
goal_df = pd.DataFrame({'first' : ['John', 'Mary', 'Larry', 'jerry'], '1' : [0.0, 6.0, 0.0, 0], '2' : [5.5, 200, 0.0, 20], '3' : [100, 100, 10, 400], '4' : [150, 0.0, 300, 110], '5' : [0.0, 0.0, 240, 0.0]})

df1
1         2        3    first      offset
5.5      100      150    John       1
6.0      200      100    Mary       0
10.0     300      240    Larry      2
20.0     400      110    jerry      1


goal_df
1      2    3    4    5  first
0    5.5  100  150    0   John
6  200.0  100    0    0   Mary
0    0.0   10  300  240  Larry
0   20.0  400  110    0  jerry

This data set will have c. 500 rows and c. 120 columns. The amount of the offset will very between 0-12. I thought about doing this with base Python functions, but I also found that difficult and the time consumer by the program would defeat the ultimate purpose which is to remove some tasks being done in Microsoft Excel.

I complain a lot about how Excel is inferior for big tasks like this, but it seems so far that the current spreadsheet offset() function in excel does do this in a very easy to use way but with thousands of formulas, is very slow. I have sold my workplace on the benefits of Python over Excel, and this is my first real trial, so speed is very important to me because I'm trying to convince my colleagues that Python can gobble up this spreadsheet much quicker than the current excel file weighing in a 96Mb in file size.

I came pretty close with the melt() function, and then taking the former column numbers and added the offset to them. However, I've had a lot of problems trying to reform the dataframe using pivot. No luck with apply or apply(lambda)!

Thanks for any help anyone can give!

like image 544
Raymond Carl Avatar asked Oct 20 '22 07:10

Raymond Carl


1 Answers

This is not especially elegant or concise but ought to do the trick. I find it a little easier to shuffle columns around in numpy (also should be a bit faster) so I first convert from a dataframe to array.

arr    = df1.values[:,:-2]    # just the numbers
offset = df1.values[:,-1]     # just the offsets
column_pad = 2
arr2 = np.zeros( (arr.shape[0],arr.shape[1]+column_pad) )

Here's the key code, which simply shifts each row by the amount of the offset.

for i, j in enumerate(offset):
    arr2[i,j:3+j] = arr[i]

array([[   0. ,    5.5,  100. ,  150. ,    0. ],
       [   6. ,  200. ,  100. ,    0. ,    0. ],
       [   0. ,    0. ,   10. ,  300. ,  240. ],
       [   0. ,   20. ,  400. ,  110. ,    0. ]])

Beyond that it's just a little bit of manual labor to add space for the columns and put them in the right order.

df2 = df1.copy()
last_column = 6
for i in map(str,range(3,last_column)):
    df2[i] = 0
df2 = df2[ map(str,range(1,last_column))+['first','offset']]

Then load arr2 into df2.

df2.loc[:,'1':'5'] = arr2

   1      2    3    4    5  first  offset
0  0    5.5  100  150    0   John       1
1  6  200.0  100    0    0   Mary       0
2  0    0.0   10  300  240  Larry       2
3  0   20.0  400  110    0  jerry       1
like image 89
JohnE Avatar answered Oct 22 '22 00:10

JohnE