I am reading JSON files into dataframes. The dataframe might have some String (object) type columns, some Numeric (int64 and/or float64), and some datetime type columns. When the data is read in, the datatype is often incorrect (ie datetime, int and float will often be stored as "object" type). I want to report on this possibility. (ie a column is in the dataframe as "object" (String), but it is actually a "datetime").
The problem i have is that when i use pd.to_numeric and pd.to_datetime they will both evaluate and try to convert the column, and many times it ends up depending on which of the two I call last... (I was going to use convert_objects() which works but that is depreciated, so wanted a better option).
The code I am using to evaluate the dataframe column is (i realize a lot of the below is redundant, but I have written it this way for readability):
try:
inferred_type = pd.to_datetime(df[Field_Name]).dtype
if inferred_type == "datetime64[ns]":
inferred_type = "DateTime"
except:
pass
try:
inferred_type = pd.to_numeric(df[Field_Name]).dtype
if inferred_type == int:
inferred_type = "Integer"
if inferred_type == float:
inferred_type = "Float"
except:
pass
In the case of pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic. Despite how well pandas works, at some point in your data analysis processes, you will likely need to explicitly convert data from one type to another.
Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .
Use Dataframe. dtypes to get Data types of columns in Dataframe. In Python's pandas module Dataframe class provides an attribute to get the data type information of each columns i.e. It returns a series object containing data type information of each column.
Pandas is great for working with tabular data, as in SQL tables or Excel spreadsheets. The main data structure in Pandas is a 2-dimensional table called DataFrame. To create a DataFrame, you can import data in several formats, such as CSV, XLSX, JSON, SQL, to name a few.
I came across the same problem of having to figure out column types for incoming data where the type is not known beforehand (from a database read in my case). I couldn't find a good answer here on SO, or by reviewing the Pandas source code. I solved it using this function:
def _get_col_dtype(col):
"""
Infer datatype of a pandas column, process only if the column dtype is object.
input: col: a pandas Series representing a df column.
"""
if col.dtype == "object":
# try numeric
try:
col_new = pd.to_datetime(col.dropna().unique())
return col_new.dtype
except:
try:
col_new = pd.to_numeric(col.dropna().unique())
return col_new.dtype
except:
try:
col_new = pd.to_timedelta(col.dropna().unique())
return col_new.dtype
except:
return "object"
else:
return col.dtype
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With