Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace values in pandas column with default value for missing keys

I have multiple simple functions that need to be implemented on every row of certain columns of my dataframe. The dataframe is very like, 10 million+ rows. My dataframe is something like this:

Date      location   city        number  value
12/3/2018   NY       New York      2      500
12/1/2018   MN       Minneapolis   3      600
12/2/2018   NY       Rochester     1      800
12/3/2018   WA       Seattle       2      400

I have functions like these:

def normalized_location(row):
    if row['city'] == " Minneapolis":
        return "FCM"
    elif row['city'] == "Seattle":
        return "FCS"
    else:
        return "Other"

and then I use:

df['Normalized Location'] =df.apply (lambda row: normalized_location (row),axis=1)

This is extremely slow, how can I make this more efficient?

like image 681
Nazanin Zinouri Avatar asked Dec 03 '18 21:12

Nazanin Zinouri


People also ask

How do I replace missing values in a column in Pandas?

The method argument of fillna() can be used to replace missing values with previous/next valid values. If method is set to 'ffill' or 'pad' , missing values are replaced with previous valid values (= forward fill), and if 'bfill' or 'backfill' , replaced with the next valid values (= backward fill).

Which method of Pandas allows you to replace missing values?

Pandas Dataframe method in Python such as fillna can be used to replace the missing values. Methods such as mean(), median() and mode() can be used on Dataframe for finding their values.

How replace values in column based on condition in Pandas?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

What is the default missing value marker in Pandas?

While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object.


1 Answers

We can make this BLAZING fast using map with a defaultdict.

from collections import defaultdict

d = defaultdict(lambda: 'Other')
d.update({"Minneapolis": "FCM", "Seattle": "FCS"})

df['normalized_location'] = df['city'].map(d)

print(df)
        Date location         city  number  value normalized_location
0  12/3/2018       NY     New York       2    500               Other
1  12/1/2018       MN  Minneapolis       3    600                 FCM
2  12/2/2018       NY    Rochester       1    800               Other
3  12/3/2018       WA      Seattle       2    400                 FCS

...to circumvent a fillna call, for performance reasons. This approach generalises to multiple replacements quite easily.

like image 183
cs95 Avatar answered Sep 21 '22 00:09

cs95