Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to preserve the datatype while iterating dataframe in pandas?

If I print out a dataframe directly, I get the correct output with correct datatypes. However, when I try to iterate the same dataframe, the datatypes are changing.

Here is my program:

import pandas as pd

F = 9.37556366342
p = 0.000101673198518
df_between = 2
df_within = 471
df_total = 473

summary_stats_vals = [(F,p,df_between,df_within,df_total)]
labels = ['F-statistics', 'p-value', 'df-between', 'df-within', 'df-total']
df = pd.DataFrame.from_records(summary_stats_vals,columns=labels)


#Iterating the dataframe
for index, row in df.iterrows():
    df_row = list()
    for col in df.columns:

As you can see from the screenshot below, the data types of df_between, df_within and df_total have not been preserved while iterating. They are changing from int to float data type. What is the way I can preserve the data types while iterating a dataframe?

enter image description here

like image 641
user3288051 Avatar asked Feb 10 '18 15:02


People also ask

Can pandas DataFrame store different data types?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

How do I iterate over a pandas DataFrame column?

One simple way to iterate over columns of pandas DataFrame is by using for loop. You can use column-labels to run the for loop over the pandas DataFrame using the get item syntax ([]) . Yields below output. The values() function is used to extract the object elements as a list.

What is the best way to iterate through a DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

What does Iterrows do in pandas?

Pandas DataFrame iterrows() Method The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).

2 Answers

From the docs:

Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).

You could use DataFrame.itertuples() and get namedtuples for each row.

>>> for r in df.itertuples(index=False):
...     print(r)

Pandas(_0=9.3755636634199995, _1=0.000101673198518, _2=2, _3=471, _4=473)
>>> for r in df.itertuples(index=False):
...     print(r._3)


Changing your column names to valid Python identifiers might make more sense:

labels = ['F_statistics', 'p_value', 'df_between', 'df_within', 'df_total']

>>> for r in df.itertuples(index=False, name='Stuff'):
...     print(r)

Stuff(F_statistics=9.3755636634199995, p_value=0.000101673198518, df_between=2, df_within=471, df_total=473)
>>> for r in df.itertuples(index=False, name='Stuff'):
...     print(r.df_total)


I haven't found, in the docs, an explicit statement that Series datatype is homogeneous, but it is inferred,it acts like a Numpy ndarray, and the constructor has a dtype parameter which applies to all the values in the Series:

One-dimensional ndarray with axis labels (including time series).

Looks like even if only one value in the Series is a float, the series dtype will be float:

>>> s = pd.Series([1,2,3,4.1], index=['a','b','c','d'])
>>> s
a    1.0
b    2.0
c    3.0
d    4.1
dtype: float64
like image 113
wwii Avatar answered Oct 20 '22 21:10


Thank you so much wwii. Yeah that worked out very well. The code below is what I needed. Thanks again for your help.

for r in df.itertuples(index=False, name='summary_stats'):
    for item in r:

I get this output:

like image 33
user3288051 Avatar answered Oct 20 '22 19:10
