I have the following question: I have a pandas dataframe, in which missing values are marked by the string na. I want to run an Imputer on it to replace the missing values with the mean in the column. According to the sklearn documentation, the parameter missing_values should help me with this:
missing_values : integer or “NaN”, optional (default=”NaN”) The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value “NaN”.
In my understanding, this means, that if I write
df = pd.read_csv(filename)
imp = Imputer(missing_values='na')
imp.fit_transform(df)
that would mean that the imputer replaces anything in the dataframe with the na value with the mean of the column. However, instead, I get an error:
ValueError: could not convert string to float: na
What am I misinterpreting? Is this not how the imputer should work? How can I replace the na strings with the mean, then? Should I just use a lambda for it?
Thank you!
Since you say you want to replace these 'na' by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na' as a missing value, and so reads the column with dtype object instead of some flavor of float.
Case in point, consider the following .csv file:
test.csv
col1,col2
1.0,1.0
2.0,2.0
3.0,3.0
na,4.0
5.0,5.0
With the naive import df = pd.read_csv('test.csv'), df.dtypes tells us that col1 is of dtype object and col2 is of dtype float64. But how do you take the mean of a bunch of objects?
The solution is to tell pd.read_csv() to interpret the string 'na' as a missing value:
df = pd.read_csv('test.csv', na_values='na')
The resulting dataframe has both columns of dtype float64, and you can now use your imputer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With