If I print out a dataframe directly, I get the correct output with correct datatypes. However, when I try to iterate the same dataframe, the datatypes are changing.
Here is my program:
import pandas as pd
F = 9.37556366342
p = 0.000101673198518
df_between = 2
df_within = 471
df_total = 473
summary_stats_vals = [(F,p,df_between,df_within,df_total)]
labels = ['F-statistics', 'p-value', 'df-between', 'df-within', 'df-total']
df = pd.DataFrame.from_records(summary_stats_vals,columns=labels)
print(df)
print()
#Iterating the dataframe
for index, row in df.iterrows():
df_row = list()
df_row.append(index)
for col in df.columns:
df_row.append(row[col])
print(row)
As you can see from the screenshot below, the data types of df_between, df_within and df_total have not been preserved while iterating. They are changing from int to float data type. What is the way I can preserve the data types while iterating a dataframe?
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.
One simple way to iterate over columns of pandas DataFrame is by using for loop. You can use column-labels to run the for loop over the pandas DataFrame using the get item syntax ([]) . Yields below output. The values() function is used to extract the object elements as a list.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
Pandas DataFrame iterrows() Method The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
From the docs:
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).
You could use DataFrame.itertuples()
and get namedtuples for each row.
>>> for r in df.itertuples(index=False):
... print(r)
Pandas(_0=9.3755636634199995, _1=0.000101673198518, _2=2, _3=471, _4=473)
>>> for r in df.itertuples(index=False):
... print(r._3)
471
Changing your column names to valid Python identifiers might make more sense:
...
labels = ['F_statistics', 'p_value', 'df_between', 'df_within', 'df_total']
...
>>> for r in df.itertuples(index=False, name='Stuff'):
... print(r)
Stuff(F_statistics=9.3755636634199995, p_value=0.000101673198518, df_between=2, df_within=471, df_total=473)
>>>
>>> for r in df.itertuples(index=False, name='Stuff'):
... print(r.df_total)
473
>>>
I haven't found, in the docs, an explicit statement that Series datatype is homogeneous, but it is inferred,it acts like a Numpy ndarray, and the constructor has a dtype parameter which applies to all the values in the Series:
One-dimensional ndarray with axis labels (including time series).
Looks like even if only one value in the Series is a float, the series dtype will be float:
>>> s = pd.Series([1,2,3,4.1], index=['a','b','c','d'])
>>> s
a 1.0
b 2.0
c 3.0
d 4.1
dtype: float64
>>>
Thank you so much wwii. Yeah that worked out very well. The code below is what I needed. Thanks again for your help.
for r in df.itertuples(index=False, name='summary_stats'):
for item in r:
print(item)
I get this output:
>>>9.37556366342
0.000101673198518
2
471
473
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With