I have a pandas dataframe which contains 3 columns, each containing a site that a user has visited during a session.
In some cases, a user may have not visited 3 sites in a single session. This is shown by a 0, denoting that no site has been visited.
import pandas as pd
df = pd.DataFrame(data=[[5, 8, 1],[8,0,0],[1,17,0]],
columns=['site1', 'site2', 'site3'])
print(df)
site1 site2 site3
0 5 8 1
1 8 0 0
2 1 17 0
In the example above, user 0 has visited sites 5, 8 and 1. User 1 has visited site 8 only, and user 2 has visited sites 1 and 17.
I would like to create a new column, last_site
, which shows the last site visited by the user in that session.
The result I want is this:
site1 site2 site3 last_site
0 5 8 1 1
1 8 0 0 8
2 1 17 0 17
How can I do this in a concise way using pandas?
Code:
df['last_site'] = df.apply(lambda x: x.iloc[x.nonzero()].iloc[-1], axis=1)
Output:
site1 site2 site3 last_site
0 5 8 1 1
1 8 0 0 8
2 1 17 0 17
mask
+ ffill
A "pure Pandas" solution:
df['last'] = df.mask(df.eq(0)).ffill(1).iloc[:, -1].astype(int)
numba
For efficiency over a large number of rows / columns, numba
can help. To see why this works better than argmax
, see Efficiently return the index of the first value satisfying condition in array.
from numba import njit
@njit
def get_last_val(A):
m, n = A.shape
res = A[:, -1]
for i in range(m):
for j in range(n):
if A[i, j] == 0:
res[i] = A[i, max(0, j-1)]
break
return res
df['last'] = get_last_val(df.values)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With