Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas iterrows changes ints into floats

I'm trying to iterate over the rows of a DataFrame that contains some int64s and some floats. iterrows() seems to be turning my ints into floats, which breaks everything I want to do downstream:

>>> import pandas as pd
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [id for id in df.id]
[10000000000000001, 10000000000000002]
>>> [r['id'] for (idx,r) in df.iterrows()]
[10000000000000000.0, 10000000000000002.0]

Iterating directly over df.id is fine. But through iterrows(), I get different values. Is there a way to iterate over the rows in such a way that I can still index by column name and get all the correct values?

like image 268
Barry Avatar asked Jan 12 '16 17:01

Barry


People also ask

What does Iterrows do in pandas?

The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).

How do pandas turn into floats?

Use pandas DataFrame. astype() function to convert column from string/int to float, you can apply this on a specific column or on an entire DataFrame. To cast the data type to 54-bit signed float, you can use numpy. float64 , numpy.

What is better than Iterrows?

itertuples iterates over the data frame as named tuples. Turning the default index off shifts the first column values into the index. It's faster than iterrows .

Is pandas apply faster than Iterrows?

By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.


2 Answers

Here's the relevant part of the docs:

Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames) [...] To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster as iterrows.

Example for your data:

>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [t[1] for t in df.itertuples()]
[10000000000000001, 10000000000000002]
like image 71
timgeb Avatar answered Oct 13 '22 22:10

timgeb


If possible you're better off avoiding iteration. Check if you can vectorize your work first.

If vectorization is impossible, you probably want DataFrame.itertuples. That will return an iterable of (named)tuples where the first element is the index label.

In [2]: list(df.itertuples())
Out[2]:
[Pandas(Index=0, id=10000000000000001, prc=1.5),
 Pandas(Index=1, id=10000000000000002, prc=2.5)]

iterrows returns a Series for each row. Since series are backed by numpy arrays, whose elements must all share a single type, your ints were cast as floats.

like image 39
TomAugspurger Avatar answered Oct 13 '22 22:10

TomAugspurger