Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python. Extract last digit of a string from a Pandas column

I want to store in a new variable the last digit from a 'UserId' (such UserId is of type string).

I came up with this, but it's a long df and takes forever. Any tips on how to optimize/avoid for loop?

df['LastDigit'] = np.nan
for i in range(0,len(df['UserId'])):
    df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
like image 540
prp Avatar asked Oct 17 '18 08:10

prp


2 Answers

Use str.strip with indexing by str[-1]:

df['LastDigit'] = df['UserId'].str.strip().str[-1]

If performance is important and no missing values use list comprehension:

df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]

Your solution is really slow, it is last solution from this:

6) updating an empty frame (e.g. using loc one-row-at-a-time)

Performance:

np.random.seed(456)
users = ['joe','jan ','ben','rick ','clare','mary','tom']
df = pd.DataFrame({
         'UserId': np.random.choice(users, size=1000),

})

In [139]: %%timeit
     ...: df['LastDigit'] = np.nan
     ...: for i in range(0,len(df['UserId'])):
     ...:     df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
     ...: 
__main__:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
57.9 s ± 1.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [140]: %timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
1.38 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [141]: %timeit df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
343 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
like image 96
jezrael Avatar answered Oct 17 '22 08:10

jezrael


Another option is to use apply. Not performant as the list comprehension but very flexible based on your goals. Here some tries on a random dataframe with shape (44289, 31)

%timeit df['LastDigit'] = df['UserId'].apply(lambda x: str(x)[-1]) #if some variables are not strings
12.4 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
31.5 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['LastDigit'] = [str(x).strip()[-1] for x in df['UserId']]
9.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 21
el_Rinaldo Avatar answered Oct 17 '22 07:10

el_Rinaldo