My data
frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.
The code is:
data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
# some code for each of the small data frames
This is super inefficient, and the code has been running for 10+ hours now.
Is there a way to speed it up?
Full Code:
d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
c = [new_df.iloc[i:] for i in range(len(new_df.index))]
x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
d = pd.concat([d, x])
To get the data:
data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])
Note:
Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)
You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.
Performance Benchmark Also, the execution speed varies significantly between the fastest contestant and the looser while loop: for-each loops are more than six times faster than while loops. Even the for-range loop is nearly two times faster than the while loop.
Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.
def make_3d(csv_filename):
def make_3d_lines(a_df):
a_df['depth'] = 0
depth = 0
prev = None
accum = []
for row in a_df.values.tolist():
row[0] = 0
key = row[1]
if key == prev:
depth += 1
accum.append(row)
else:
if depth == 0:
yield row
else:
depth = 0
to_emit = []
for i in range(len(accum)):
date = accum[i][2]
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j
to_emit[-1][2] = date
for r in to_emit[1:]:
yield r
accum = [row]
prev = key
df_data = pd.read_csv('big-data.csv')
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split())),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)
df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))
1.7390995025634766
depth feature0 feature1 feature2
id date
207555809644681 20180104 1 0.03125 0.038623 0.008130
247833985674646 20180106 1 0.03125 0.004378 0.004065
252945024181083 20180107 1 0.03125 0.062836 0.065041
20180107 2 0.00000 0.001870 0.008130
20180109 1 0.00000 0.001870 0.008130
329567241731951 20180117 1 0.00000 0.041952 0.004065
20180117 2 0.03125 0.003101 0.004065
20180117 3 0.00000 0.030780 0.004065
20180118 1 0.03125 0.003101 0.004065
20180118 2 0.00000 0.030780 0.004065
I believe your approach for feature engineering could be done better, but I will stick to answering your question.
In Python, iterating over a Dictionary is way faster than iterating over a DataFrame
Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):
# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index() # the index will be (id, date)
# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))
# iterate over the Dictionary and process the values
for key, value in d.items():
pass # each value is a Dataframe
# concat the values and get the original (processed) Dataframe back
df2 = pd.concat(d.values(), ignore_index=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With