I'm iterating over a dataframe with 1000s of rows. I ideally would like to know the progress of my loops - i.e. how many rows has it completed, what percentage of total rows has it completed etc.
Is there a way I can print the row number or even better, the percentage of rows iterated over?
My code it currently below. Currently, printing how it looks below right now displays some kind of tuple/list however all I need is the row number. This is probably simple.
for row in testDF.iterrows():
print("Currently on row: "+str(row))
Ideal printed response:
Currently on row 1; Currently iterated 1% of rows
Currently on row 2; Currently iterated 2% of rows
Currently on row 3; Currently iterated 3% of rows
Currently on row 4; Currently iterated 4% of rows
Currently on row 5; Currently iterated 5% of rows
You can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.
Definition and Usage The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
To Generate Row number to the dataframe in R we will be using seq.int() function. Seq.int() function along with nrow() is used to generate row number to the dataframe in R. We can also use row_number() function to generate row index.
iterrows() is used to iterate over a pandas Data frame rows in the form of (index, series) pair. This function iterates over the data frame column, it will return a tuple with the column name and content in form of series. Syntax: DataFrame.iterrows() Yields: index- The index of the row.
First of all iterrows
gives tuples of (index, row)
. So the proper code is
for index, row in testDF.iterrows():
Index in general case is not a number of row, it is some identifier (this is a power of pandas, but it makes some confusions as it behaves not as ordinary list
in python where the index is the number of row). That is why we need to calculate the number of rows independently. We can introduce line_number = 0
and increase it in each cirlce line_number += 1
. But python gives us a ready tool for that: enumerate
, which returns tuples of (line_number, value)
instead of just value
. So we come down to that code
for line_number, (index, row) in enumerate(testDF.iterrows()):
print("Currently on row: {}; Currently iterated {}% of rows".format(
line_number, 100*(line_number + 1)/len(testDF)))
P.S. python2 returns integer when you divide integers, that is why 999/1000 = 0, what you don't expect. So you can either force float or take 100*
to the beginning to get integer percent.
One possible solution with format
if unique monotonic index (0,1,2,...
):
for i, row in testDF.iterrows():
print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))
Sample:
np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)))
print (testDF)
0 1 2
0 8 1 9
1 4 3 5
2 0 1 3
3 1 8 6
4 7 4 7
5 7 5 3
6 7 9 9
7 0 1 2
8 1 3 4
9 0 0 3
for i, row in testDF.iterrows():
print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))
Currently on row: 0; Currently iterrated 10.0% of rows
Currently on row: 1; Currently iterrated 20.0% of rows
Currently on row: 2; Currently iterrated 30.0% of rows
Currently on row: 3; Currently iterrated 40.0% of rows
Currently on row: 4; Currently iterrated 50.0% of rows
Currently on row: 5; Currently iterrated 60.0% of rows
Currently on row: 6; Currently iterrated 70.0% of rows
Currently on row: 7; Currently iterrated 80.0% of rows
Currently on row: 8; Currently iterrated 90.0% of rows
Currently on row: 9; Currently iterrated 100.0% of rows
EDIT:
If some custom index values, solution with zip
and numpy.arange
by length of index
what is same of length of df
:
np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)), index=[2,4,5,6,7,8,2,1,3,5])
print (testDF)
0 1 2
2 8 1 9
4 4 3 5
5 0 1 3
6 1 8 6
7 7 4 7
8 7 5 3
2 7 9 9
1 0 1 2
3 1 3 4
5 0 0 3
for i, (idx, row) in zip(np.arange(len(testDF.index)), testDF.iterrows()):
print("Currently on row: {}; Currently iterrated {}% of rows".format(idx, (i + 1)/len(testDF.index) * 100))
Currently on row: 2; Currently iterrated 10.0% of rows
Currently on row: 4; Currently iterrated 20.0% of rows
Currently on row: 5; Currently iterrated 30.0% of rows
Currently on row: 6; Currently iterrated 40.0% of rows
Currently on row: 7; Currently iterrated 50.0% of rows
Currently on row: 8; Currently iterrated 60.0% of rows
Currently on row: 2; Currently iterrated 70.0% of rows
Currently on row: 1; Currently iterrated 80.0% of rows
Currently on row: 3; Currently iterrated 90.0% of rows
Currently on row: 5; Currently iterrated 100.0% of rows
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With