Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop Pandas DataFrame from converting int to float for no reason?

I am creating a small Pandas DataFrame and adding some data to it which is supposed to be integers. But even though I am trying very hard to explicitly set the dtype to int and only provide int values, it always ends up becoming floats. It is making no sense to me at all and the behaviour doesn't even seem entirely consistent.

Consider the following Python script:

import pandas as pd

df = pd.DataFrame(columns=["col1", "col2"])  # No dtype specified.
print(df.dtypes)  # dtypes are object, since there is no information yet.
df.loc["row1", :] = int(0)  # Add integer data.
print(df.dtypes)  # Both columns have now become int64, as expected.
df.loc["row2", :] = int(0)  # Add more integer data.
print(df.dtypes)  # Both columns are now float64???
print(df)  # Shows as 0.0.

# Let's try again, but be more specific.
del df  
df = pd.DataFrame(columns=["col1", "col2"], dtype=int)  # Explicit set dtype.
print(df.dtypes)  # For some reason both colums are already float64???
df.loc["row1", :] = int(0)
print(df.dtypes)  # Both colums still float64.

# Output:
"""
col1    object
col2    object
dtype: object
col1    int64
col2    int64
dtype: object
col1    float64
col2    float64
dtype: object
      col1  col2
row1   0.0   0.0
row2   0.0   0.0
col1    float64
col2    float64
dtype: object
col1    float64
col2    float64
dtype: object
"""

I can fix it by doing df = df.astype(int) at the end. There are other ways to fix it as well. But this should not be necessary. I am trying to figure out what I am doing wrong that makes the columns become floats in the first place.

What is going on?

Python version 3.7.1 Pandas version 0.23.4

EDIT:

I think maybe some people are misunderstanding. There are never any NaN values in this DataFrame. Immediately after its creation it looks like this:

Empty DataFrame
Columns: [col1, col2]
Index: []

It is an empty Dataframe, df.shape=0, but there is no NaN in it, there's just no rows yet.

I have also discovered something even worse. Even if I do df = df.astype(int) after adding data such that it becomes int, it becomes float again as soon as I add more data!

df = pd.DataFrame(columns=["col1", "col2"], dtype=int)
df.loc["row1", :] = int(0)
df.loc["row2", :] = int(0)
df = df.astype(int)  # Force it back to int.
print(df.dtypes)  # It is now ints again.
df.loc["row3", :] = int(0)  # Add another integer row.
print(df.dtypes)  # It is now float again???

# Output:
"""
col1    int32
col2    int32
dtype: object
col1    float64
col2    float64
dtype: object
"""

The suggested fix in version 0.24 does not seem related to my problem. That feature is about Nullable Integer Data Type. There are no NaN or None values in my data.

like image 937
PaulMag Avatar asked Apr 01 '19 12:04

PaulMag


People also ask

Does Python automatically convert int to float?

Python 3 automatically converts integers to floats as needed.

How do I stop indexing in Pandas?

The most straightforward way to drop a Pandas dataframe index is to use the Pandas . reset_index() method. By default, the method will only reset the index, forcing values from 0 - len(df)-1 as the index. The method will also simply insert the dataframe index into a column in the dataframe.

How do Pandas change to float?

pandas Convert String to Float Use pandas DataFrame. astype() function to convert column from string/int to float, you can apply this on a specific column or on an entire DataFrame. To cast the data type to 54-bit signed float, you can use numpy. float64 , numpy.


1 Answers

df.loc["rowX"] = int(0) will work and solves the problem posed in the question. df.loc["rowX",:] = int(0) does not work. That is a surprise.

df.loc["rowX"] = int(0) provides the ability to populate an empty dataframe while preserving the desired dtype. But one can do so for an entire row at a time.

df.loc["rowX"] = [np.int64(0), np.int64(1)] works.

.loc[] is appropriate for label based assignment per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html. Note: the 0.24 doc does not depict .loc[] for inserting new rows.

The doc shows use of .loc[] to add rows by assignment in a column sensitive way. But does so where the DataFrame is populated with data.

But it gets weird when slicing on the empty frame.

import pandas as pd
import numpy as np
import sys

print(sys.version)
print(pd.__version__)

print("int dtypes preserved")
# append on populated DataFrame
df = pd.DataFrame([[0, 0], [1,1]], index=['a', 'b'], columns=["col1", "col2"])
df.loc["c"] = np.int64(0)
# slice existing rows
df.loc["a":"c"] = np.int64(1)
df.loc["a":"c", "col1":"col2":1] = np.int64(2)
print(df.dtypes)

# no selection AND no data, remains np.int64 if defined as such
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
df.loc[:, "col1":"col2":1] = np.int64(0)
df.loc[:,:] = np.int64(0)
print(df.dtypes)

# and works if no index but data
df = pd.DataFrame([[0, 0], [1,1]], columns=["col1", "col2"])
df.loc[:,"col1":"col2":1] = np.int64(0)
print(df.dtypes)

# the surprise... label based insertion for the entire row does not convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
df.loc["a"] = np.int64(0)
print(df.dtypes)

# a surprise because referring to all columns, as above, does convert to float
print("unexpectedly converted to float dtypes")
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)

3.7.2 (default, Mar 19 2019, 10:33:22) 
[Clang 10.0.0 (clang-1000.11.45.5)]
0.24.2
int dtypes preserved
col1    int64
col2    int64
dtype: object
col1    int64
col2    int64
dtype: object
col1    int64
col2    int64
dtype: object
col1    int64
col2    int64
dtype: object
unexpectedly converted to float dtypes
col1    float64
col2    float64
dtype: object
like image 86
Rich Andrews Avatar answered Sep 20 '22 11:09

Rich Andrews