I am working with the following dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
which it says that has some missing values marked with "?". I want to apply the SimpleImputer library and my code is the following:
file="breast_cancer"
df=pd.read_csv(file,names=['id', 'clump_thickness','unif_cell_size',
'unif_cell_shape', 'marg_adhesion', 'single_epith_cell_size',
'bare_nuclei', 'bland_chromatin', 'normal_nucleoli','mitoses','class'])
df.replace('?',np.NaN,inplace=True)
imp=SimpleImputer(missing_values="NaN")
idf=pd.DataFrame(imp.fit_transform(df))
idf.columns=df.columns
idf.index=df.index
so I want to replace all the ? values from all the columns with the mean, and return a new dataframe. The problem is that I got the following error:
Input contains NaN, infinity or a value too large for dtype('float64').
What am I missing?
You are trying to impute a "NaN"
, a str
, whereas you've replaced ?
with np.NaN
.
Instantiate SimpleImputer
with np.nan
and works fine:
df.replace('?',np.NaN,inplace=True)
imp=SimpleImputer(missing_values=np.NaN)
idf=pd.DataFrame(imp.fit_transform(df))
idf.columns=df.columns
idf.index=df.index
idf['bare_nuclei'].isna().sum()
Output:
0
# No nan : Imputing successful
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With