Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to append rows in a pandas dataframe in a for loop?

I have the following for loop:

for i in links:      data = urllib2.urlopen(str(i)).read()      data = json.loads(data)      data = pd.DataFrame(data.items())      data = data.transpose()      data.columns = data.iloc[0]      data = data.drop(data.index[[0]]) 

Each dataframe so created has most columns in common with the others but not all of them. Moreover, they all have just one row. What I need to to is to add to the dataframe all the distinct columns and each row from each dataframe produced by the for loop

I tried pandas concatenate or similar but nothing seemed to work. Any idea? Thanks.

like image 601
Blue Moon Avatar asked Jul 28 '15 11:07

Blue Moon


People also ask

How do you append rows to a DataFrame in a for loop?

It turns out Pandas does have an effective way to append to a dataframe: df. loc( len(df) ) = [new, row, of, data] will "append" to the end of a dataframe in-place.

How do I append multiple rows to a DataFrame in Python?

Add multiple rows to pandas dataframe We can pass a list of series too in the dataframe. append() for appending multiple rows in dataframe. For example, we can create a list of series with same column names as dataframe i.e. Now pass this list of series to the append() function i.e.

How do I add rows to an existing DataFrame in Python?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. ignore_index : If True, do not use the index labels.


1 Answers

Suppose your data looks like this:

import pandas as pd import numpy as np  np.random.seed(2015) df = pd.DataFrame([]) for i in range(5):     data = dict(zip(np.random.choice(10, replace=False, size=5),                     np.random.randint(10, size=5)))     data = pd.DataFrame(data.items())     data = data.transpose()     data.columns = data.iloc[0]     data = data.drop(data.index[[0]])     df = df.append(data) print('{}\n'.format(df)) # 0   0   1   2   3   4   5   6   7   8   9 # 1   6 NaN NaN   8   5 NaN NaN   7   0 NaN # 1 NaN   9   6 NaN   2 NaN   1 NaN NaN   2 # 1 NaN   2   2   1   2 NaN   1 NaN NaN NaN # 1   6 NaN   6 NaN   4   4   0 NaN NaN NaN # 1 NaN   9 NaN   9 NaN   7   1   9 NaN NaN 

Then it could be replaced with

np.random.seed(2015) data = [] for i in range(5):     data.append(dict(zip(np.random.choice(10, replace=False, size=5),                          np.random.randint(10, size=5)))) df = pd.DataFrame(data) print(df) 

In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data) once at the end, outside the loop.

Each call to df.append requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, its performance will be much better -- the time cost of copying grows linearly with the number of rows.

like image 182
unutbu Avatar answered Oct 07 '22 03:10

unutbu