Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Iterrows Row Number & Percentage

Tags:

python

pandas

I'm iterating over a dataframe with 1000s of rows. I ideally would like to know the progress of my loops - i.e. how many rows has it completed, what percentage of total rows has it completed etc.

Is there a way I can print the row number or even better, the percentage of rows iterated over?

My code it currently below. Currently, printing how it looks below right now displays some kind of tuple/list however all I need is the row number. This is probably simple.

for row in testDF.iterrows():

        print("Currently on row: "+str(row))

Ideal printed response:

Currently on row 1; Currently iterated 1% of rows
Currently on row 2; Currently iterated 2% of rows
Currently on row 3; Currently iterated 3% of rows
Currently on row 4; Currently iterated 4% of rows
Currently on row 5; Currently iterated 5% of rows
like image 301
christaylor Avatar asked Jul 02 '17 13:07

christaylor


People also ask

How do I get the row number in pandas?

You can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.

What is row in Iterrows?

Definition and Usage The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).

How do I add a row number to a DataFrame?

To Generate Row number to the dataframe in R we will be using seq.int() function. Seq.int() function along with nrow() is used to generate row number to the dataframe in R. We can also use row_number() function to generate row index.

What is index and row in Iterrows?

iterrows() is used to iterate over a pandas Data frame rows in the form of (index, series) pair. This function iterates over the data frame column, it will return a tuple with the column name and content in form of series. Syntax: DataFrame.iterrows() Yields: index- The index of the row.


2 Answers

First of all iterrows gives tuples of (index, row). So the proper code is

for index, row in testDF.iterrows():

Index in general case is not a number of row, it is some identifier (this is a power of pandas, but it makes some confusions as it behaves not as ordinary list in python where the index is the number of row). That is why we need to calculate the number of rows independently. We can introduce line_number = 0 and increase it in each cirlce line_number += 1. But python gives us a ready tool for that: enumerate, which returns tuples of (line_number, value) instead of just value. So we come down to that code

for line_number, (index, row) in enumerate(testDF.iterrows()):
    print("Currently on row: {}; Currently iterated {}% of rows".format(
          line_number, 100*(line_number + 1)/len(testDF)))

P.S. python2 returns integer when you divide integers, that is why 999/1000 = 0, what you don't expect. So you can either force float or take 100* to the beginning to get integer percent.

like image 94
Leonid Mednikov Avatar answered Sep 19 '22 23:09

Leonid Mednikov


One possible solution with format if unique monotonic index (0,1,2,...):

for i, row in testDF.iterrows():
        print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))

Sample:

np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)))
print (testDF)
   0  1  2
0  8  1  9
1  4  3  5
2  0  1  3
3  1  8  6
4  7  4  7
5  7  5  3
6  7  9  9
7  0  1  2
8  1  3  4
9  0  0  3

for i, row in testDF.iterrows():
        print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))
Currently on row: 0; Currently iterrated 10.0% of rows
Currently on row: 1; Currently iterrated 20.0% of rows
Currently on row: 2; Currently iterrated 30.0% of rows
Currently on row: 3; Currently iterrated 40.0% of rows
Currently on row: 4; Currently iterrated 50.0% of rows
Currently on row: 5; Currently iterrated 60.0% of rows
Currently on row: 6; Currently iterrated 70.0% of rows
Currently on row: 7; Currently iterrated 80.0% of rows
Currently on row: 8; Currently iterrated 90.0% of rows
Currently on row: 9; Currently iterrated 100.0% of rows

EDIT:

If some custom index values, solution with zip and numpy.arange by length of index what is same of length of df:

np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)), index=[2,4,5,6,7,8,2,1,3,5])
print (testDF)
   0  1  2
2  8  1  9
4  4  3  5
5  0  1  3
6  1  8  6
7  7  4  7
8  7  5  3
2  7  9  9
1  0  1  2
3  1  3  4
5  0  0  3

for i, (idx, row) in zip(np.arange(len(testDF.index)), testDF.iterrows()):
    print("Currently on row: {}; Currently iterrated {}% of rows".format(idx, (i + 1)/len(testDF.index) * 100))

Currently on row: 2; Currently iterrated 10.0% of rows
Currently on row: 4; Currently iterrated 20.0% of rows
Currently on row: 5; Currently iterrated 30.0% of rows
Currently on row: 6; Currently iterrated 40.0% of rows
Currently on row: 7; Currently iterrated 50.0% of rows
Currently on row: 8; Currently iterrated 60.0% of rows
Currently on row: 2; Currently iterrated 70.0% of rows
Currently on row: 1; Currently iterrated 80.0% of rows
Currently on row: 3; Currently iterrated 90.0% of rows
Currently on row: 5; Currently iterrated 100.0% of rows
like image 40
jezrael Avatar answered Sep 19 '22 23:09

jezrael