I have a dataframe, df, that has some columns of type float64, while the others are of object. Due to the mixed nature, I cannot use
df.fillna('unknown') #getting error "ValueError: could not convert string to float:"
as the error happened with the columns whose type is float64 (what a misleading error message!)
so I'd wish that I could do something like
for col in df.columns[<dtype == object>]:
df[col] = df[col].fillna("unknown")
So my question is if there is any such filter expression that I can use with df.columns?
I guess alternatively, less elegantly, I could do:
for col in df.columns:
if (df[col].dtype == dtype('O')): # for object type
df[col] = df[col].fillna('')
# still puzzled, only empty string works as replacement, 'unknown' would not work for certain value leading to error of "ValueError: Error parsing datetime string "unknown" at position 0"
I also would like to know why in the above code replacing '' with 'unknown' the code would work for certain cells but failed with a cell with the error of "ValueError: Error parsing datetime string "unknown" at position 0"
Thanks a lot!
Yu
To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.
Use Dataframe. dtypes to get Data types of columns in Dataframe. In Python's pandas module Dataframe class provides an attribute to get the data type information of each columns i.e. It returns a series object containing data type information of each column.
Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.
This is conciser:
# select the float columns df_num = df.select_dtypes(include=[np.float]) # select non-numeric columns df_num = df.select_dtypes(exclude=[np.number])
You can see what the dtype is for all the columns using the dtypes attribute:
In [11]: df = pd.DataFrame([[1, 'a', 2.]])
In [12]: df
Out[12]:
0 1 2
0 1 a 2
In [13]: df.dtypes
Out[13]:
0 int64
1 object
2 float64
dtype: object
In [14]: df.dtypes == object
Out[14]:
0 False
1 True
2 False
dtype: bool
To access the object columns:
In [15]: df.loc[:, df.dtypes == object]
Out[15]:
1
0 a
I think it's most explicit to use (I'm not sure that inplace would work here):
In [16]: df.loc[:, df.dtypes == object] = df.loc[:, df.dtypes == object].fillna('')
Saying that, I recommend you use NaN for missing data.
As @RNA said, you can use pandas.DataFrame.select_dtypes. The code using your example from a question would look like this:
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].fillna('unknown')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With