Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, astype(int) applied to float column returns negative numbers

My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:

df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name', 
                     'Штрихкод':'barcode', 
                     'Цена шт. руб.':'price',
                     'Остаток': 'balance'
                    })
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()

It returns column barcode with type float64 (why so?)

0    0.000000e+00
1    7.613037e+12
2    7.613037e+12
3    7.613034e+12
4    7.613035e+12
Name: barcode, dtype: float64

Then I try to convert that column to integer.

df_1.barcode = df_1.barcode.astype(int)

But I keep getting silly negative numbers.

df_1.barcode[0:5]
0             0
1   -2147483648
2   -2147483648
3   -2147483648
4   -2147483648

Name: barcode, dtype: int32

Thanks to @Will and @micric eventually I've got a solution.

df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')

Resume:

  • pd.to_numeric converts NaN to float64. As a result from column with both NaN and not-Nan values we should expect column dtype float64.
  • Check size of number you're dealing with. int32 has its limit, which is 2**32 = 4294967296. Thanks a lot for your help, guys!
like image 975
Sergey Solod Avatar asked May 31 '19 08:05

Sergey Solod


People also ask

What does Astype float do?

pandas Convert String to Float astype() function to convert column from string/int to float, you can apply this on a specific column or on an entire DataFrame. To cast the data type to 54-bit signed float, you can use numpy. float64 , numpy.

What does Astype do in pandas?

Pandas DataFrame astype() Method The astype() method returns a new DataFrame where the data types has been changed to the specified type.

What is Astype int?

astype() is a method within numpy. ndarray , as well as the Pandas Series class, so can be used to convert vectors, matrices and columns within a DataFrame . However, int() is a pure-Python function that can only be applied to scalar values. For example, you can do int(3.14) , but can't do (2.7).


3 Answers

That number is a 32 bit lower limit. Your number is out of the int32 range you are trying to use, so it returns you the limit (notice that 2**32 = 4294967296, divided by 2 2147483648 that is your number).

You should use astype(int64) instead.

like image 53
micric Avatar answered Oct 11 '22 00:10

micric


I ran into the same problem as OP, using

astype(np.int64)

solved mine, see the link here.

I like this solution because it's consistent with my habit of changing the column type of pandas column, maybe someone could check the performance of these solutions.

like image 30
Jason Goal Avatar answered Oct 11 '22 02:10

Jason Goal


Many questions in one.

So your expected dtype...

pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)

pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs

like image 4
Will Avatar answered Oct 11 '22 02:10

Will