I have a generator that returns an unknown number of rows of data that I want to convert to an indexed pandas dataframe. The fastest way I know of is to write a CSV to disk then parse back in via 'read_csv'. I'm aware that it is not efficient to create an empty dataframe then constantly append new rows. I can't create a pre-sized dataframe because I do not know how many rows will be returned. Is there a way to convert the iterator output to a pandas dataframe without writing to disk?
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
The reason iterrows() is slower than itertuples() is due to iterrows() doing a lot of type checks in the lifetime of its call.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
Pandas DataFrame
accepts iterator as the data source in the constructor. You can dynamically generate rows and feed them to a data frame, as you are reading and transforming the source data.
This is easiest done by writing a generator function that uses yield
to feed the results.
After the data frame has been generated you can use set_index
to choose any column as an index.
Here is an example:
def create_timeline(self) -> pd.DataFrame:
"""Create a timeline feed how we traded over a course of time.
Note: We assume each position has only one enter and exit event, not position increases over the lifetime.
:return: DataFrame with timestamp and timeline_event columns
"""
# https://stackoverflow.com/questions/42999332/fastest-way-to-convert-python-iterator-output-to-pandas-dataframe
def gen_events():
"""Generate data for the dataframe.
Use Python generators to dynamically fill Pandas dataframe.
Each dataframe gets timestamp, timeline_event columns.
"""
for pair_id, history in self.asset_histories.items():
for position in history.positions:
open_event = TimelineEvent(
pair_id=pair_id,
position=position,
type=TimelineEventType.open,
)
yield (position.opened_at, open_event)
# If position is closed generated two events
if position.is_closed():
close_event = TimelineEvent(
pair_id=pair_id,
position=position,
type=TimelineEventType.close,
)
yield (position.closed_at, close_event)
df = pd.DataFrame(gen_events(), columns=["timestamp", "timeline_event"])
df = df.set_index(["timestamp"])
return df
The full open source example can be found here.
Iteratively appending to a pandas data frame is not the best solution. It is better to build your data as a list, and then pass it to pd.DataFrame
.
import random
import pandas as pd
alpha = list('abcdefghijklmnopqrstuvwxyz')
Here we create a generator, use it to construct a list, then pass it to the dataframe constructor:
%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
my_data = [x for x in gen]
df = pd.DataFrame(my_data, columns=['letter','value'])
# result: 1 loop, best of 3: 373 ms per loop
This is quite a bit faster than creating a generator, construct an empty dataframe, and appending rows, seen here:
%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
df = pd.DataFrame(columns=['letter','value'])
for tup in gen:
df.loc[df.shape[0],:] = tup
# result: 1 loop, best of 3: 13.6 s per loop
This is incredibly slow at 13 seconds to construct 10000 rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With