Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining Pandas Column DataType

Sometimes when data is imported to Pandas Dataframe, it always imports as type object. This is fine and well for doing most operations, but I am trying to create a custom export function, and my question is this:

  • Is there a way to force Pandas to infer the data types of the input data?
  • If not, is there a way after the data is loaded to infer the data types somehow?

I know I can tell Pandas that this is of type int, str, etc.. but I don't want to do that, I was hoping pandas could be smart enough to know all the data types when a user imports or adds a column.

EDIT - example of import

a = ['a']
col = ['somename']
df = pd.DataFrame(a, columns=col)
print(df.dtypes)
>>> somename    object
dtype: object

The type should be string?

like image 338
code base 5000 Avatar asked Dec 21 '16 12:12

code base 5000


People also ask

How do you find the data type of a column in Python?

Use Dataframe. dtypes to get Data types of columns in Dataframe. In Python's pandas module Dataframe class provides an attribute to get the data type information of each columns i.e. It returns a series object containing data type information of each column.

What is the datatype of a column in any DataFrame?

A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype . Make conscious decisions about how to manage missing data. A DataFrame can be saved to a CSV file using the to_csv function.

What data type is a pandas DataFrame?

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels.

How do I change the datatype of a column in pandas?

to_numeric() The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric() . This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.


2 Answers

This is only a partial answer, but you can get frequency counts of the data type of the elements in a variable over the entire DataFrame as follows:

dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]

This returns

dtypeCount

[<class 'numpy.int32'>    4
 Name: a, dtype: int64,
 <class 'int'>    2
 <class 'str'>    2
 Name: b, dtype: int64,
 <class 'numpy.int32'>    4
 Name: c, dtype: int64]

It doesn't print this nicely, but you can pull out information for any variable by location:

dtypeCount[1]

<class 'int'>    2
<class 'str'>    2
Name: b, dtype: int64

which should get you started in finding what data types are causing the issue and how many of them there are.

You can then inspect the rows that have a str object in the second variable using

df[df.iloc[:,1].map(lambda x: type(x) == str)]

   a  b  c
1  1  n  4
3  3  g  6

data

df = DataFrame({'a': range(4),
                'b': [6, 'n', 7, 'g'],
                'c': range(3, 7)})
like image 159
lmo Avatar answered Sep 22 '22 18:09

lmo


You can also infer the objects from after dropping irrelevant items by using infer_objects(). Below is a general example.

df_orig = pd.DataFrame({"A": ["a", 1, 2, 3], "B": ["b", 1.2, 1.8, 1.8]})
df = df_orig.iloc[1:].infer_objects()
print(df_orig.dtypes, df.dtypes, sep='\n\n')

Output:

output print

like image 41
shahar_m Avatar answered Sep 21 '22 18:09

shahar_m