Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pd.concat change the resulting datatype from int to float?

I have three dataframes: timestamp (with timestamps), dataSun (with timestamps of sunrise and sunset), dataData (with different climate data). Dataframe timestamp has datatype "int64".

timestamp.head() timestamp 0 1521681600000 1 1521681900000 2 1521682200000 3 1521682500000 4 1521682800000

Dataframe dataSun has also datatype "int64".

 dataSun.head()
         sunrise         sunset
0  1521696105000  1521740761000
1  1521696105000  1521740761000
2  1521696105000  1521740761000
3  1521696105000  1521740761000
4  1521696105000  1521740761000

Dataframe with climate data dataData has datatype "float64".

dataData.head()
           temperature     pressure  humidity
    0     2.490000  1018.000000      99.0
    1     2.408333  1017.833333      99.0
    2     2.326667  1017.666667      99.0
    3     2.245000  1017.500000      99.0
    4     2.163333  1017.333333      99.0
    5     2.081667  1017.166667      99.0

I want to concatenate these three dataframes in one.

dataResult = pd.concat((timestamp, dataSun, dataData), axis = 1)
dataResult.head()
       timestamp       sunrise        sunset  temperature     pressure     
0  1521681600000  1.521696e+12  1.521741e+12     2.490000  1018.000000   
1  1521681900000  1.521696e+12  1.521741e+12     2.408333  1017.833333   
2  1521682200000  1.521696e+12  1.521741e+12     2.326667  1017.666667   
3  1521682500000  1.521696e+12  1.521741e+12     2.245000  1017.500000   
4  1521682800000  1.521696e+12  1.521741e+12     2.163333  1017.333333   
5  1521683100000  1.521696e+12  1.521741e+12     2.081667  1017.166667   

weatherMeasurements.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7188 entries, 0 to 7187
Data columns (total 6 columns):
timestamp      7188 non-null int64
sunrise        7176 non-null float64
sunset         7176 non-null float64
temperature    7176 non-null float64
pressure       7176 non-null float64
humidity       7176 non-null float64
dtypes: float64(5), int64(1)

Why pd.concat has changes the datatype of the values DataSun? I have tried different ways to concatenate the dataframes. For example, I concatenated only timestamp and dataSun in one dataframe, then I concatenated resulted dataframe with dataData. But it was the same result. How can I concatenate three dataframes and secure the datatypes?

like image 258
M.Nelly Avatar asked Apr 20 '18 11:04

M.Nelly


2 Answers

Because of this -

timestamp      7188 non-null int64
sunrise        7176 non-null float64
...

timestamp has 7188 non-null values, while sunrise and onwards have 7176. It goes without saying that there are 12 values that are not non-null... meaning they're NaNs.

Since NaNs are of dtype=float, every other value in that column is automatically upcasted to float, and float numbers that big are usually represented in scientific notation.

That's the why, but that doesn't really solve your problem. Your options at this point are

  1. drop those rows with NaNs using dropna
  2. fill those NaNs with some default integeral value using fillna

(Now you may downcast these rows to int.)

  1. Alternatively, if you perform pd.concat with join='inner', NaNs are not introduced and the dtypes are preserved.

    pd.concat((timestamp, dataSun, dataData), axis=1, join='inner')
    
           timestamp        sunrise         sunset  temperature     pressure  \    
    0  1521681600000  1521696105000  1521740761000     2.490000  1018.000000   
    1  1521681900000  1521696105000  1521740761000     2.408333  1017.833333   
    2  1521682200000  1521696105000  1521740761000     2.326667  1017.666667   
    3  1521682500000  1521696105000  1521740761000     2.245000  1017.500000   
    4  1521682800000  1521696105000  1521740761000     2.163333  1017.333333   
    
       humidity  
    0      99.0  
    1      99.0  
    2      99.0  
    3      99.0  
    4      99.0 
    

With option 3, an inner join is performed on the indexes of each dataframe.

like image 144
cs95 Avatar answered Sep 19 '22 11:09

cs95


As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaNs discussed in this answer.

like image 31
totalhack Avatar answered Sep 20 '22 11:09

totalhack