I would like all my dataframes, regardless of whether they're built up from any one of the constructor overloads, whether they're derived from <code>.read_csv()</code>, <code>.read_xlsx()</code>, <code>.read_sql()</code>, or any other method, to use the new nullable <code>Int64</code> datatype as the default <code>dtype</code> for all integers, rather than <code>int64</code>. I'm willing to go to literally any level of insanity to do this if there isn't a 'nice' way, including subclassing the DataFrame or Series classes, and reimplementing any number of methods and constructor attributes, etc. My question is, can this be done? If so, how would I go about it?

You could use a function like this: <pre class="prettyprint lang-py prettyprint-override"><code>def nan_ints(df,convert_strings=False,subset = None): types = ['int64','float64'] if subset is None: subset = list(df) if convert_strings: types.append('object') for col in subset: try: if df[col].dtype in types: df[col] = df[col].astype(float).astype('Int64') except: pass return df </code></pre> It iterates through each column and coverts it to an Int64 if it is a int. If it's a float it will convert to a Int64 only if all of the values in the column could be converted to ints other than the NaN's. I've given you the option to convert strings to Int64 as well with the convert_strings argument. <pre class="prettyprint lang-py prettyprint-override"><code>df1 = pd.DataFrame({'a':[1.1,2,3,1], 'b':[1,2,3,np.nan], 'c':['1','2','3',np.nan], 'd':[3,2,1,np.nan]}) nan_ints(df1,convert_strings=True,subset=['b','c']) df1.info() </code></pre> Will return the following: <pre class="prettyprint"><code><class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 4 columns): a 4 non-null float64 b 3 non-null Int64 c 3 non-null Int64 d 3 non-null float64 dtypes: Int64(2), float64(2) memory usage: 216.0 bytes </code></pre> if you are going to use this on every DataFrame you could add the function to a module and import it everytime you want to use pandas. <code>from my_module import nan_ints</code> Then just use it with something like: <code>nan_ints(pd.read_csv(path))</code> Note: Nullable integer data type is New in version 0.24.0. Here is the documentation.

Making Int64 the default integer dtype instead of standard int64 in pandas

Tags:

I would like all my dataframes, regardless of whether they're built up from any one of the constructor overloads, whether they're derived from .read_csv(), .read_xlsx(), .read_sql(), or any other method, to use the new nullable Int64 datatype as the default dtype for all integers, rather than int64.

I'm willing to go to literally any level of insanity to do this if there isn't a 'nice' way, including subclassing the DataFrame or Series classes, and reimplementing any number of methods and constructor attributes, etc.

My question is, can this be done? If so, how would I go about it?

995

asked May 20 '19 12:05

matthewgdv

2 Answers

You could use a function like this:

def nan_ints(df,convert_strings=False,subset = None):
    types = ['int64','float64']
    if subset is None:
        subset = list(df)
    if convert_strings:
        types.append('object')
    for col in subset:
        try:
            if df[col].dtype in types:
                df[col] = df[col].astype(float).astype('Int64')
        except:
            pass
    return df

It iterates through each column and coverts it to an Int64 if it is a int. If it's a float it will convert to a Int64 only if all of the values in the column could be converted to ints other than the NaN's. I've given you the option to convert strings to Int64 as well with the convert_strings argument.

df1 = pd.DataFrame({'a':[1.1,2,3,1],
                  'b':[1,2,3,np.nan],
                  'c':['1','2','3',np.nan],
                  'd':[3,2,1,np.nan]})


nan_ints(df1,convert_strings=True,subset=['b','c'])
df1.info()

Will return the following:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
a    4 non-null float64
b    3 non-null Int64
c    3 non-null Int64
d    3 non-null float64
dtypes: Int64(2), float64(2)
memory usage: 216.0 bytes

if you are going to use this on every DataFrame you could add the function to a module and import it everytime you want to use pandas. from my_module import nan_ints Then just use it with something like: nan_ints(pd.read_csv(path))

Note: Nullable integer data type is New in version 0.24.0. Here is the documentation.

192

answered Oct 09 '22 03:10

braintho

I would put my money on monkey patching. The easiest way would be to monkey patch the DataFrame constructor. That should go something like this:

import pandas
pandas.DataFrame.__old__init__ = pandas.DataFrame.__init__
def new_init(self, data=None, index=None, columns=None, dtype=pd.Int64Dtype(), copy=False):
    self.__old__init__(data=data, index=index, columns=None, dtype=dtype, copy=copy)

pandas.DataFrame.__init__ = new_init

Of course, you run the risk of breaking the world. Good luck!

answered Oct 09 '22 03:10

Joel

Related questions
                            
                                How to check which StorageVolume we have access to, and which we don't?
                            
                                Await an async function in Python debugger
                            
                                Cannot create platform OpenGL context
                            
                                Vuetify 2 grouped data table with customized group header and item rows
                            
                                Unexpected behavior of a C# 8.0 default interface member
                            
                                Programmatically talking to a Serial Port in OS X or Linux
                            
                                How to reference to multiple version assembly
                            
                                Get information about the source of an event with jQuery
                            
                                HTML Minification in C#
                            
                                embedded web browser
                            
                                How to reuse an instance of hashlib.md5
                            
                                Intended use of the Unit Testing built in to CodeIgniter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With