I am new here, ideally i would have commented this on the question from where i learned this usage of idxmax : I used same approach and below is my code <pre class="prettyprint"><code>df = pd.DataFrame(np.arange(16).reshape(4,4),columns=["A","B","C","D"],index=[0,1,2,3]) </code></pre> As soon as i use <code>df[(df>6)]</code> on this df these int values change to float? <pre class="prettyprint"><code> A B C D 0 NaN NaN NaN NaN 1 NaN NaN NaN 7.0 2 8.0 9.0 10.0 11.0 3 12.0 13.0 14.0 15.0 </code></pre> Why does pandas do that? Also, i read somewhere i could use dtype=object on series , but are there some other ways to avoid such thing?

The limitation is mostly with Numpy. <ul> <li>Numpy's <code>ndarray</code> can only be of a single type.</li> <li>There does not exist an integer type null value.</li> </ul> So we end up with a dilemma when we do <code>df[df > 6]</code>. What is going to happen is Pandas is going to return a dataframe with values equal to <code>df</code> where <code>df > 6</code> and null otherwise. But like I said, there isn't an integer null value. So we have a choice to make. <ol> <li>Use <code>None</code> or <code>np.nan</code> for null values while making the entire <code>ndarray</code> of <code>dtype==object</code> </li> <li>Use <code>np.nan</code> as our null and make the entire array of <code>dtype==float</code> </li> </ol> Pandas chooses to make the arrays into float because keeping the values numeric will keep many of the advantages that come with numeric <code>dtypes</code> and their calculations. <hr> Option 1 Use a fill value and <code>pd.DataFrame.where</code> <pre class="prettyprint"><code>df.where(df > 6, -1) A B C D 0 -1 -1 -1 -1 1 -1 -1 -1 7 2 8 9 10 11 3 12 13 14 15 </code></pre> <hr> Option 2 <code>pd.DataFrame.stack</code> and <code>loc</code> By converting to a single dimension, we aren't forced to fill missing values in the rectangular grid with nulls. <pre class="prettyprint"><code>df.stack().loc[lambda x: x > 6] 1 D 7 2 A 8 B 9 C 10 D 11 3 A 12 B 13 C 14 D 15 dtype: int64 </code></pre>

If you do want to have the int look like <pre class="prettyprint"><code>df.astype(object).mask(df<=6) Out[114]: A B C D 0 NaN NaN NaN NaN 1 NaN NaN NaN 7 2 8 9 10 11 3 12 13 14 15 </code></pre> You can looking for more information at here, and here This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”. One possibility is to use dtype=object arrays instead. More information about <code>astype(object)</code> <pre class="prettyprint"><code>df.astype(object).mask(df<=6).applymap(type) Out[115]: A B C D 0 <class 'float'> <class 'float'> <class 'float'> <class 'float'> 1 <class 'float'> <class 'float'> <class 'float'> <class 'int'> 2 <class 'int'> <class 'int'> <class 'int'> <class 'int'> 3 <class 'int'> <class 'int'> <class 'int'> <class 'int'> </code></pre>

Why pandas by themseleves convert int values in dataframe to float?

Tags:

python-3.x

pandas

I am new here, ideally i would have commented this on the question from where i learned this usage of idxmax :

I used same approach and below is my code

df = pd.DataFrame(np.arange(16).reshape(4,4),columns=["A","B","C","D"],index=[0,1,2,3])

As soon as i use df[(df>6)] on this df these int values change to float?

        A   B   C   D
0   NaN NaN NaN NaN
1   NaN NaN NaN 7.0
2   8.0 9.0 10.0    11.0
3   12.0    13.0    14.0    15.0

Why does pandas do that? Also, i read somewhere i could use dtype=object on series , but are there some other ways to avoid such thing?

774

asked Nov 07 '17 05:11

Avij

3 Answers

The limitation is mostly with Numpy.

Numpy's ndarray can only be of a single type.
There does not exist an integer type null value.

So we end up with a dilemma when we do df[df > 6]. What is going to happen is Pandas is going to return a dataframe with values equal to df where df > 6 and null otherwise. But like I said, there isn't an integer null value. So we have a choice to make.

Use None or np.nan for null values while making the entire ndarray of dtype==object
Use np.nan as our null and make the entire array of dtype==float

Pandas chooses to make the arrays into float because keeping the values numeric will keep many of the advantages that come with numeric dtypes and their calculations.

Option 1
Use a fill value and pd.DataFrame.where

df.where(df > 6, -1)

    A   B   C   D
0  -1  -1  -1  -1
1  -1  -1  -1   7
2   8   9  10  11
3  12  13  14  15

Option 2
pd.DataFrame.stack and loc
By converting to a single dimension, we aren't forced to fill missing values in the rectangular grid with nulls.

df.stack().loc[lambda x: x > 6]

1  D     7
2  A     8
   B     9
   C    10
   D    11
3  A    12
   B    13
   C    14
   D    15
dtype: int64

165

answered Sep 19 '22 20:09

piRSquared

If you do want to have the int look like

df.astype(object).mask(df<=6)
Out[114]: 
     A    B    C    D
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN    7
2    8    9   10   11
3   12   13   14   15

You can looking for more information at here, and here

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”. One possibility is to use dtype=object arrays instead.

More information about astype(object)

df.astype(object).mask(df<=6).applymap(type)
Out[115]: 
                 A                B                C                D
0  <class 'float'>  <class 'float'>  <class 'float'>  <class 'float'>
1  <class 'float'>  <class 'float'>  <class 'float'>    <class 'int'>
2    <class 'int'>    <class 'int'>    <class 'int'>    <class 'int'>
3    <class 'int'>    <class 'int'>    <class 'int'>    <class 'int'>

answered Sep 16 '22 20:09

BENY

In previous versions (<0.24.0) pandas indeed converted any int columns to floats, if even a single NaN was present. But not anymore, since Optional Nullable Integer Support is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values.

answered Sep 19 '22 20:09

mork

Related questions
                            
                                Does python have a similar function of "Chop" in Mathematica?
                            
                                Pandas read_hdf very slow for non-numeric data
                            
                                Using Gensim shows "Slow version of gensim.models.doc2vec being used"
                            
                                Is it possible to create a regex-constrained type hint?
                            
                                Reverse a dictionary in python to key : list of values?
                            
                                Encoding error using google adwords api
                            
                                How does python 3 print(list, list.pop())? [duplicate]
                            
                                How PyQt5 keyPressEvent works
                            
                                How to count accesses per hour from log file entries?
                            
                                __slots__ conflicts with a class variable in a generic class
                            
                                'Shuffle' is claimed to be an invalid parameter for model_selection.train_test_split
                            
                                Can python metaclasses inherit?
                            
                                Using Chardet to find encoding of very large file
                            
                                Alternatives to `tell()` while iterating over lines of a file in Python3?
                            
                                How to put a bounding box around groups of subplots while using gridspec?
                            
                                What is the difference between uploading a file to S3 using boto3.resource.put_object() and boto3.s3.transfer.upload_file()
                            
                                Find longest unique substring in string python
                            
                                Python 3: Move email to trash by uid (imaplib)
                            
                                Identify if there are two of the same character adjacent to eachother
                            
                                How to get a list of modules imported by a python module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With