Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas inferring column datatypes

I am reading JSON files into dataframes. The dataframe might have some String (object) type columns, some Numeric (int64 and/or float64), and some datetime type columns. When the data is read in, the datatype is often incorrect (ie datetime, int and float will often be stored as "object" type). I want to report on this possibility. (ie a column is in the dataframe as "object" (String), but it is actually a "datetime").

The problem i have is that when i use pd.to_numeric and pd.to_datetime they will both evaluate and try to convert the column, and many times it ends up depending on which of the two I call last... (I was going to use convert_objects() which works but that is depreciated, so wanted a better option).

The code I am using to evaluate the dataframe column is (i realize a lot of the below is redundant, but I have written it this way for readability):

try:
   inferred_type = pd.to_datetime(df[Field_Name]).dtype
   if inferred_type == "datetime64[ns]":
      inferred_type = "DateTime"
except:
   pass
try:
   inferred_type = pd.to_numeric(df[Field_Name]).dtype
   if inferred_type == int:
      inferred_type = "Integer"
   if inferred_type == float:
      inferred_type = "Float"
except:
   pass
like image 215
Calamari Avatar asked Jan 25 '16 21:01

Calamari


People also ask

Does pandas infer type?

In the case of pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic. Despite how well pandas works, at some point in your data analysis processes, you will likely need to explicitly convert data from one type to another.

Can pandas column have different data types?

Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .

How do I find the data type of a column in Python?

Use Dataframe. dtypes to get Data types of columns in Dataframe. In Python's pandas module Dataframe class provides an attribute to get the data type information of each columns i.e. It returns a series object containing data type information of each column.

What data type is used in pandas for any analysis?

Pandas is great for working with tabular data, as in SQL tables or Excel spreadsheets. The main data structure in Pandas is a 2-dimensional table called DataFrame. To create a DataFrame, you can import data in several formats, such as CSV, XLSX, JSON, SQL, to name a few.


1 Answers

I came across the same problem of having to figure out column types for incoming data where the type is not known beforehand (from a database read in my case). I couldn't find a good answer here on SO, or by reviewing the Pandas source code. I solved it using this function:

def _get_col_dtype(col):
        """
        Infer datatype of a pandas column, process only if the column dtype is object. 
        input:   col: a pandas Series representing a df column. 
        """

        if col.dtype == "object":
            # try numeric
            try:
                col_new = pd.to_datetime(col.dropna().unique())
                return col_new.dtype
            except:
                try:
                    col_new = pd.to_numeric(col.dropna().unique())
                    return col_new.dtype
                except:
                    try:
                        col_new = pd.to_timedelta(col.dropna().unique())
                        return col_new.dtype
                    except:
                        return "object"
        else:
            return col.dtype
like image 82
PabTorre Avatar answered Sep 20 '22 06:09

PabTorre