I am trying to fill missing values (NAN) using the below code
NAN_SUBSTITUTION_VALUE = 1
g = g.fillna(NAN_SUBSTITUTION_VALUE)
but I am getting the following error
ValueError: fill value must be in categories.
Would anybody please throw some light on this error.
Method 1: Filling with most occurring class One approach to fill these missing values can be to replace them with the most common or occurring class. We can do this by taking the index of the most common class which can be determined by using value_counts() method.
Imputation Method 1: Most Common Class One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.
You can use df = df. fillna(df['Label']. value_counts(). index[0]) to fill NaNs with the most frequent value from one column.
Your question is missing the important point what g
is, especially that it has dtype categorical
. I assume it is something like this:
g = pd.Series(["A", "B", "C", np.nan], dtype="category")
The problem you are experiencing is that fillna
requires a value that already exists as a category. For instance, g.fillna("A")
would work, but g.fillna("D")
fails. To fill the series with a new value you can do:
g_without_nan = g.cat.add_categories("D").fillna("D")
Add the category before you fill:
g = g.cat.add_categories([1])
g.fillna(1)
Once you create Categorical Data, you can insert only values in category.
>>> df
ID value
0 0 20
1 1 43
2 2 45
>>> df["cat"] = df["value"].astype("category")
>>> df
ID value cat
0 0 20 20
1 1 43 43
2 2 45 45
>>> df.loc[1, "cat"] = np.nan
>>> df
ID value cat
0 0 20 20
1 1 43 NaN
2 2 45 45
>>> df.fillna(1)
ValueError: fill value must be in categories
>>> df.fillna(43)
ID value cat
0 0 20 20
1 1 43 43
2 2 45 45
As many have said before, this error comes from the fact that that feature's type is 'category'.
I suggest converting it to string first, use fillna and finally convert it back to category if needed.
g = g.astype('string')
g = g.fillna(NAN_SUBSTITUTION_VALUE)
g = g.astype('category')
Sometimes you may want to replace the NaN with values present in your dataset, you can use that then:
#creates a random permuation of the categorical values
permutation = np.random.permutation(df[field])
#erase the empty values
empty_is = np.where(permutation == "")
permutation = np.delete(permutation, empty_is)
#replace all empty values of the dataframe[field]
end = len(permutation)
df[field] = df[field].apply(lambda x: permutation[np.random.randint(end)] if pd.isnull(x) else x)
It works quite efficiently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With