Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pandas .append within for loop

I am appending rows to a pandas DataFrame within a for loop, but at the end the dataframe is always empty. I don't want to add the rows to an array and then call the DataFrame constructer, because my actual for loop handles lots of data. I also tried pd.concat without success. Could anyone highlight what I am missing to make the append statement work? Here's a dummy example:

import pandas as pd import numpy as np  data = pd.DataFrame([])  for i in np.arange(0, 4):     if i % 2 == 0:         data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)     else:         data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)  print data.head()  Empty DataFrame Columns: [] Index: [] [Finished in 0.676s] 
like image 884
calpyte Avatar asked May 03 '16 16:05

calpyte


People also ask

How do you append data to a DataFrame in a for loop?

It turns out Pandas does have an effective way to append to a dataframe: df. loc( len(df) ) = [new, row, of, data] will "append" to the end of a dataframe in-place.

Can we use for loop with pandas DataFrame?

DataFrame Looping (iteration) with a for statement. You can loop over a pandas dataframe, for each column row by row.

How do I add rows to an existing DataFrame in Python?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. ignore_index : If True, do not use the index labels.


1 Answers

Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).

In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.

a_list = [] b_list = [] for data in my_data:     a, b = process_data(data)     a_list.append(a)     b_list.append(b) df = pd.DataFrame({'A': a_list, 'B': b_list}) del a_list, b_list 

Timings

%%timeit data = pd.DataFrame([]) for i in np.arange(0, 10000):     if i % 2 == 0:         data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True) else:     data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True) 1 loops, best of 3: 6.8 s per loop  %%timeit a_list = [] b_list = [] for i in np.arange(0, 10000):     if i % 2 == 0:         a_list.append(i)         b_list.append(i + 1)     else:         a_list.append(i)         b_list.append(None) data = pd.DataFrame({'A': a_list, 'B': b_list}) 100 loops, best of 3: 8.54 ms per loop 
like image 175
Alexander Avatar answered Sep 17 '22 22:09

Alexander