I would like all my dataframes, regardless of whether they're built up from any one of the constructor overloads, whether they're derived from .read_csv()
, .read_xlsx()
, .read_sql()
, or any other method, to use the new nullable Int64
datatype as the default dtype
for all integers, rather than int64
.
I'm willing to go to literally any level of insanity to do this if there isn't a 'nice' way, including subclassing the DataFrame or Series classes, and reimplementing any number of methods and constructor attributes, etc.
My question is, can this be done? If so, how would I go about it?
You can use np. int64 in type to convert column to int64. You're converting No_Of_Units column to np.
You will often see the data type Int64 in Python which stands for 64 bit integer. The 64 simply refers to the memory allocated to store data in each cell which effectively relates to how many digits it can store in each “cell”.
In order to convert data types in pandas, there are three basic options: Use astype() to force an appropriate dtype. Create a custom function to convert the data. Use pandas functions such as to_numeric() or to_datetime()
The df. astype(int) converts Pandas float to int by negelecting all the floating point digits. df. round(0).
You could use a function like this:
def nan_ints(df,convert_strings=False,subset = None):
types = ['int64','float64']
if subset is None:
subset = list(df)
if convert_strings:
types.append('object')
for col in subset:
try:
if df[col].dtype in types:
df[col] = df[col].astype(float).astype('Int64')
except:
pass
return df
It iterates through each column and coverts it to an Int64 if it is a int. If it's a float it will convert to a Int64 only if all of the values in the column could be converted to ints other than the NaN's. I've given you the option to convert strings to Int64 as well with the convert_strings argument.
df1 = pd.DataFrame({'a':[1.1,2,3,1],
'b':[1,2,3,np.nan],
'c':['1','2','3',np.nan],
'd':[3,2,1,np.nan]})
nan_ints(df1,convert_strings=True,subset=['b','c'])
df1.info()
Will return the following:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
a 4 non-null float64
b 3 non-null Int64
c 3 non-null Int64
d 3 non-null float64
dtypes: Int64(2), float64(2)
memory usage: 216.0 bytes
if you are going to use this on every DataFrame you could add the function to a module and import it everytime you want to use pandas.
from my_module import nan_ints
Then just use it with something like:
nan_ints(pd.read_csv(path))
Note: Nullable integer data type is New in version 0.24.0. Here is the documentation.
I would put my money on monkey patching. The easiest way would be to monkey patch the DataFrame constructor. That should go something like this:
import pandas
pandas.DataFrame.__old__init__ = pandas.DataFrame.__init__
def new_init(self, data=None, index=None, columns=None, dtype=pd.Int64Dtype(), copy=False):
self.__old__init__(data=data, index=index, columns=None, dtype=dtype, copy=copy)
pandas.DataFrame.__init__ = new_init
Of course, you run the risk of breaking the world. Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With