I have the following for loop:
for i in links: data = urllib2.urlopen(str(i)).read() data = json.loads(data) data = pd.DataFrame(data.items()) data = data.transpose() data.columns = data.iloc[0] data = data.drop(data.index[[0]])
Each dataframe so created has most columns in common with the others but not all of them. Moreover, they all have just one row. What I need to to is to add to the dataframe all the distinct columns and each row from each dataframe produced by the for loop
I tried pandas concatenate or similar but nothing seemed to work. Any idea? Thanks.
It turns out Pandas does have an effective way to append to a dataframe: df. loc( len(df) ) = [new, row, of, data] will "append" to the end of a dataframe in-place.
Add multiple rows to pandas dataframe We can pass a list of series too in the dataframe. append() for appending multiple rows in dataframe. For example, we can create a list of series with same column names as dataframe i.e. Now pass this list of series to the append() function i.e.
append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. ignore_index : If True, do not use the index labels.
Suppose your data looks like this:
import pandas as pd import numpy as np np.random.seed(2015) df = pd.DataFrame([]) for i in range(5): data = dict(zip(np.random.choice(10, replace=False, size=5), np.random.randint(10, size=5))) data = pd.DataFrame(data.items()) data = data.transpose() data.columns = data.iloc[0] data = data.drop(data.index[[0]]) df = df.append(data) print('{}\n'.format(df)) # 0 0 1 2 3 4 5 6 7 8 9 # 1 6 NaN NaN 8 5 NaN NaN 7 0 NaN # 1 NaN 9 6 NaN 2 NaN 1 NaN NaN 2 # 1 NaN 2 2 1 2 NaN 1 NaN NaN NaN # 1 6 NaN 6 NaN 4 4 0 NaN NaN NaN # 1 NaN 9 NaN 9 NaN 7 1 9 NaN NaN
Then it could be replaced with
np.random.seed(2015) data = [] for i in range(5): data.append(dict(zip(np.random.choice(10, replace=False, size=5), np.random.randint(10, size=5)))) df = pd.DataFrame(data) print(df)
In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data)
once at the end, outside the loop.
Each call to df.append
requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append
in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, its performance will be much better -- the time cost of copying grows linearly with the number of rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With