I have a Dataframe called df_base that looks like this. As you can see, there's a column called Sex that's male or female. I want to map these values to 0 and 1, respectively.
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | NaN | S |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
There's a few methods that I've seen dotted about on StackOverflow but I'm wondering what the most efficient is to perform the following mapping:
+---------+---------+
| Old Sex | New Sex |
+---------+---------+
| male | 0 |
| female | 1 |
| female | 1 |
| female | 1 |
| male | 0 |
+---------+---------+
I'm using this:
df_base['Sex'].replace(['male','female'],[0,1],inplace=True)
... but I can't help but feel as though this is a little shoddy. Is there a better way of doing this? There's also using .loc but that loops around the rows of the Dataframe, so is less efficient, right?
I think here is better/faster use map by dictionary if only male and female exist in column Sex:
df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))
What is same like:
df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})
Solution if exist only female and male values is cast boolean mask to integers True/False to 1,0:
df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)
Performance:
np.random.seed(2019)
import perfplot
def ma(df):
df = df.copy()
df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
return df
def rep1(df):
df = df.copy()
df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
return df
def nwhere(df):
df = df.copy()
df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
return df
def mask1(df):
df = df.copy()
df['Sex_new'] = (df['Sex'] == 'female').astype(int)
return df
def mask2(df):
df = df.copy()
df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
return df
def make_df(n):
df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})
return df
perfplot.show(
setup=make_df,
kernels=[ma, rep1, nwhere, mask1, mask2],
n_range=[2**k for k in range(2, 18)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')

Conclusion:
If replace only 2 values is slowiest replace, numpy.where, map and mask are similar. For improve performance compare by numpy array with .values.
Also all depends of data, so best test with real data.
My instinct would have suggested to use .map(), but I made a comparison between your solution and map, based on a dataframe with 1500 random male/female values.
%timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
1000 loops, best of 3: 653 µs per loop
Edited Based on coldspeeds comment, and because reassigning it is a better comparison with the others:
%timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
1000 loops, best of 3: 968 µs per loop
So actually slower .map()...!
So based on this example, your 'shoddy' solution seems faster than .map()...
Edit
pygo's solution:
%timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
1000 loops, best of 3: 331 µs per loop
So faster!
Jezrael's solution with .astype(int):
%timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
1000 loops, best of 3: 388 µs per loop
So also faster than .map() and .replace().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With