Sometimes when data is imported to Pandas Dataframe, it always imports as type object
. This is fine and well for doing most operations, but I am trying to create a custom export function, and my question is this:
I know I can tell Pandas that this is of type int, str, etc.. but I don't want to do that, I was hoping pandas could be smart enough to know all the data types when a user imports or adds a column.
EDIT - example of import
a = ['a']
col = ['somename']
df = pd.DataFrame(a, columns=col)
print(df.dtypes)
>>> somename object
dtype: object
The type should be string?
Use Dataframe. dtypes to get Data types of columns in Dataframe. In Python's pandas module Dataframe class provides an attribute to get the data type information of each columns i.e. It returns a series object containing data type information of each column.
A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype . Make conscious decisions about how to manage missing data. A DataFrame can be saved to a CSV file using the to_csv function.
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels.
to_numeric() The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric() . This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
This is only a partial answer, but you can get frequency counts of the data type of the elements in a variable over the entire DataFrame as follows:
dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]
This returns
dtypeCount
[<class 'numpy.int32'> 4
Name: a, dtype: int64,
<class 'int'> 2
<class 'str'> 2
Name: b, dtype: int64,
<class 'numpy.int32'> 4
Name: c, dtype: int64]
It doesn't print this nicely, but you can pull out information for any variable by location:
dtypeCount[1]
<class 'int'> 2
<class 'str'> 2
Name: b, dtype: int64
which should get you started in finding what data types are causing the issue and how many of them there are.
You can then inspect the rows that have a str object in the second variable using
df[df.iloc[:,1].map(lambda x: type(x) == str)]
a b c
1 1 n 4
3 3 g 6
data
df = DataFrame({'a': range(4),
'b': [6, 'n', 7, 'g'],
'c': range(3, 7)})
You can also infer the objects from after dropping irrelevant items by using infer_objects()
. Below is a general example.
df_orig = pd.DataFrame({"A": ["a", 1, 2, 3], "B": ["b", 1.2, 1.8, 1.8]})
df = df_orig.iloc[1:].infer_objects()
print(df_orig.dtypes, df.dtypes, sep='\n\n')
Output:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With