I'm trying to iterate over the rows of a DataFrame that contains some int64s and some floats. iterrows()
seems to be turning my ints into floats, which breaks everything I want to do downstream:
>>> import pandas as pd
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [id for id in df.id]
[10000000000000001, 10000000000000002]
>>> [r['id'] for (idx,r) in df.iterrows()]
[10000000000000000.0, 10000000000000002.0]
Iterating directly over df.id
is fine. But through iterrows()
, I get different values. Is there a way to iterate over the rows in such a way that I can still index by column name and get all the correct values?
The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
Use pandas DataFrame. astype() function to convert column from string/int to float, you can apply this on a specific column or on an entire DataFrame. To cast the data type to 54-bit signed float, you can use numpy. float64 , numpy.
itertuples iterates over the data frame as named tuples. Turning the default index off shifts the first column values into the index. It's faster than iterrows .
By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
Here's the relevant part of the docs:
Because
iterrows
returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames) [...] To preserve dtypes while iterating over the rows, it is better to useitertuples()
which returns namedtuples of the values and which is generally faster asiterrows
.
Example for your data:
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [t[1] for t in df.itertuples()]
[10000000000000001, 10000000000000002]
If possible you're better off avoiding iteration. Check if you can vectorize your work first.
If vectorization is impossible, you probably want DataFrame.itertuples
. That will return an iterable of (named)tuples where the first element is the index label.
In [2]: list(df.itertuples())
Out[2]:
[Pandas(Index=0, id=10000000000000001, prc=1.5),
Pandas(Index=1, id=10000000000000002, prc=2.5)]
iterrows
returns a Series for each row. Since series are backed by numpy arrays, whose elements must all share a single type, your ints were cast as floats.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With