Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - filling NaNs in Categorical data

Tags:

python

pandas

I am trying to fill missing values (NAN) using the below code

NAN_SUBSTITUTION_VALUE = 1
g = g.fillna(NAN_SUBSTITUTION_VALUE)

but I am getting the following error

ValueError: fill value must be in categories.

Would anybody please throw some light on this error.

like image 339
deega Avatar asked Sep 22 '15 13:09

deega


People also ask

How do you fill NaN values for categorical data?

Method 1: Filling with most occurring class One approach to fill these missing values can be to replace them with the most common or occurring class. We can do this by taking the index of the most common class which can be determined by using value_counts() method.

How do you impute missing values for categorical variables?

Imputation Method 1: Most Common Class One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.

How do you replace NaN with most common value?

You can use df = df. fillna(df['Label']. value_counts(). index[0]) to fill NaNs with the most frequent value from one column.


5 Answers

Your question is missing the important point what g is, especially that it has dtype categorical. I assume it is something like this:

g = pd.Series(["A", "B", "C", np.nan], dtype="category")

The problem you are experiencing is that fillna requires a value that already exists as a category. For instance, g.fillna("A") would work, but g.fillna("D") fails. To fill the series with a new value you can do:

g_without_nan = g.cat.add_categories("D").fillna("D")
like image 96
bluenote10 Avatar answered Oct 19 '22 23:10

bluenote10


Add the category before you fill:

g = g.cat.add_categories([1])
g.fillna(1)
like image 42
G. Cheng Avatar answered Oct 19 '22 22:10

G. Cheng


Once you create Categorical Data, you can insert only values in category.

>>> df
    ID  value
0    0     20
1    1     43
2    2     45

>>> df["cat"] = df["value"].astype("category")
>>> df
    ID  value    cat
0    0     20     20
1    1     43     43
2    2     45     45

>>> df.loc[1, "cat"] = np.nan
>>> df
    ID  value    cat
0    0     20     20
1    1     43    NaN
2    2     45     45

>>> df.fillna(1)
ValueError: fill value must be in categories
>>> df.fillna(43)
    ID  value    cat
0    0     20     20
1    1     43     43
2    2     45     45
like image 7
pacholik Avatar answered Oct 19 '22 22:10

pacholik



As many have said before, this error comes from the fact that that feature's type is 'category'.
I suggest converting it to string first, use fillna and finally convert it back to category if needed.

g = g.astype('string')
g = g.fillna(NAN_SUBSTITUTION_VALUE)
g = g.astype('category')
like image 3
Yves Avatar answered Oct 19 '22 22:10

Yves


Sometimes you may want to replace the NaN with values present in your dataset, you can use that then:

#creates a random permuation of the categorical values
permutation = np.random.permutation(df[field])

#erase the empty values
empty_is = np.where(permutation == "")
permutation = np.delete(permutation, empty_is)

#replace all empty values of the dataframe[field]
end = len(permutation)
df[field] = df[field].apply(lambda x: permutation[np.random.randint(end)] if pd.isnull(x) else x)

It works quite efficiently.

like image 1
Victor Zuanazzi Avatar answered Oct 19 '22 23:10

Victor Zuanazzi