Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe: len(df) is not equal to number of iterations in df.iterrows()

I have a dataframe where I want to print each row to a different file. When the dataframe consists of e.g. only 50 rows, len(df) will print 50 and iterating over the rows of the dataframe like

for index, row in df.iterrows():
    print(index)

will print the index from 0 to 49.

However, if my dataframe contains more than 50'000 rows, len(df)and the number of iterations when iterating over df.iterrows() differ significantly. For example, len(df) will say e.g. 50'554 and printing the index will go up to over 400'000.

How can this be? What am I missing here?

like image 401
dliv Avatar asked Sep 06 '16 12:09

dliv


1 Answers

First, as @EdChum noted in the comment, your question's title refers to iterrows, but the example you give refers to iteritems, which loops in the orthogonal direction to that relevant to len. I assume you meant iterrows (as in the title).

Note that a DataFrame's index need not be a running index, irrespective of the size of the DataFrame. For example:

df = pd.DataFrame({'a': [1, 2, 3, 4]}, index=[2, 4, 5, 1000])

>>> for index, row in df.iterrows():
...     print index
2
4
5
1000

Presumably, your long DataFrame was just created differently, then, or underwent some manipulation, affecting the index.

If you really must iterate with a running index, you can use Python's enumerate:

>>> for index, row in enumerate(df.iterrows()):
...     print index
0
1
2
3

(Note that, in this case, row is itself a tuple.)

like image 165
Ami Tavory Avatar answered Sep 29 '22 17:09

Ami Tavory