Sometimes when data is imported to Pandas Dataframe, it always imports as type <code>object</code>. This is fine and well for doing most operations, but I am trying to create a custom export function, and my question is this: <ul> <li>Is there a way to force Pandas to infer the data types of the input data?</li> <li>If not, is there a way after the data is loaded to infer the data types somehow?</li> </ul> I know I can tell Pandas that this is of type int, str, etc.. but I don't want to do that, I was hoping pandas could be smart enough to know all the data types when a user imports or adds a column. EDIT - example of import <pre class="prettyprint"><code>a = ['a'] col = ['somename'] df = pd.DataFrame(a, columns=col) print(df.dtypes) >>> somename object dtype: object </code></pre> The type should be string?

You can also infer the objects from after dropping irrelevant items by using <code>infer_objects()</code>. Below is a general example. <pre class="prettyprint"><code>df_orig = pd.DataFrame({"A": ["a", 1, 2, 3], "B": ["b", 1.2, 1.8, 1.8]}) df = df_orig.iloc[1:].infer_objects() print(df_orig.dtypes, df.dtypes, sep='\n\n') </code></pre> Output: <img src="https://i.stack.imgur.com/wMzzU.png" alt="output print">

Determining Pandas Column DataType

Tags:

python

pandas

dataframe

Sometimes when data is imported to Pandas Dataframe, it always imports as type object. This is fine and well for doing most operations, but I am trying to create a custom export function, and my question is this:

Is there a way to force Pandas to infer the data types of the input data?
If not, is there a way after the data is loaded to infer the data types somehow?

I know I can tell Pandas that this is of type int, str, etc.. but I don't want to do that, I was hoping pandas could be smart enough to know all the data types when a user imports or adds a column.

EDIT - example of import

a = ['a']
col = ['somename']
df = pd.DataFrame(a, columns=col)
print(df.dtypes)
>>> somename    object
dtype: object

The type should be string?

338

asked Dec 21 '16 12:12

code base 5000

2 Answers

This is only a partial answer, but you can get frequency counts of the data type of the elements in a variable over the entire DataFrame as follows:

dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]

This returns

dtypeCount

[<class 'numpy.int32'>    4
 Name: a, dtype: int64,
 <class 'int'>    2
 <class 'str'>    2
 Name: b, dtype: int64,
 <class 'numpy.int32'>    4
 Name: c, dtype: int64]

It doesn't print this nicely, but you can pull out information for any variable by location:

dtypeCount[1]

<class 'int'>    2
<class 'str'>    2
Name: b, dtype: int64

which should get you started in finding what data types are causing the issue and how many of them there are.

You can then inspect the rows that have a str object in the second variable using

df[df.iloc[:,1].map(lambda x: type(x) == str)]

   a  b  c
1  1  n  4
3  3  g  6

data

df = DataFrame({'a': range(4),
                'b': [6, 'n', 7, 'g'],
                'c': range(3, 7)})

159

answered Sep 22 '22 18:09

lmo

You can also infer the objects from after dropping irrelevant items by using infer_objects(). Below is a general example.

df_orig = pd.DataFrame({"A": ["a", 1, 2, 3], "B": ["b", 1.2, 1.8, 1.8]})
df = df_orig.iloc[1:].infer_objects()
print(df_orig.dtypes, df.dtypes, sep='\n\n')

Output:

output print

answered Sep 21 '22 18:09

shahar_m

Related questions
                            
                                Sklearn TFIDF vectorizer to run as parallel jobs
                            
                                Unable to start App Engine application after updating it via Google Cloud SDK
                            
                                Geany - How to execute code in terminal pane instead of external terminal
                            
                                SettingWithCopyWarning even when using .loc[row_indexer,col_indexer] = value
                            
                                How can i count occurrence of each word in document using Dictionary comprehension
                            
                                Confusing about a Python min quiz
                            
                                Getting attribute error: 'map' object has no attribute 'sort'
                            
                                How to count number of space in given string in python [duplicate]
                            
                                Parallel file writing is it efficient?
                            
                                Using Concurrent Futures without running out of RAM
                            
                                Converting an array dict to xml in python?
                            
                                Building a StructType from a dataframe in pyspark
                            
                                Appropriate Deep Learning Structure for multi-class classification
                            
                                Theano CNN on CPU: AbstractConv2d Theano optimization failed
                            
                                asterisk in tuple, list and set definitions, double asterisk in dict definition
                            
                                Jupyter: disable restart kernel warning
                            
                                Python Click: Having the group execute code AFTER a command
                            
                                Docker Django 404 for web static files, but fine for admin static files
                            
                                Can pandas read a transposed CSV?
                            
                                How to select last row and also how to access PySpark dataframe by index?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With