Imputation of missing values for categories in pandas

Tags:

pandas

The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?

In R randomForest package there is na.roughfix option : A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

in Pandas for numeric variables I can fill NaN values with :

df = df.fillna(df.median())

322

asked Sep 16 '15 20:09

Igor Barinov

2 Answers

You can use df = df.fillna(df['Label'].value_counts().index[0]) to fill NaNs with the most frequent value from one column.

If you want to fill every column with its own most frequent value you can use

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

UPDATE 2018-25-10 ⬇

Starting from 0.13.1 pandas includes mode method for Series and Dataframes. You can use it to fill missing values for each column (using its own most frequent value) like this

df = df.fillna(df.mode().iloc[0])

138

answered Sep 25 '22 04:09

hellpanderr

def fillna(col):     col.fillna(col.value_counts().index[0], inplace=True)     return col df=df.apply(lambda col:fillna(col))

answered Sep 23 '22 04:09

Pratik Gohil

Related questions
                            
                                AttributeError while using Django Rest Framework with serializers
                            
                                Table 'roles_users' is already defined for this MetaData instance
                            
                                Matplotlib y axis values are not ordered [duplicate]
                            
                                suds install error: no module named client
                            
                                Pandas - Replace values based on index
                            
                                Introspection to get decorator names on a method?
                            
                                Import from sibling directory
                            
                                Is there an efficient way of concatenating scipy.sparse matrices?
                            
                                Python 'list indices must be integers, not tuple"
                            
                                Django template convert to string
                            
                                Does ImageDataGenerator add more images to my dataset?
                            
                                Choose list variable given probability of each variable
                            
                                How to invert a permutation array in numpy
                            
                                How to delete a file by extension in Python?
                            
                                Handling GET and POST in same Flask view
                            
                                Download a folder from S3 using Boto3
                            
                                Is there any legitimate use of list[True], list[False] in Python?
                            
                                Fetch all href link using selenium in python
                            
                                Running Tensorflow in Jupyter Notebook
                            
                                How can I get the screen size in Tkinter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With