Fastest way to convert python iterator output to pandas dataframe

Tags:

I have a generator that returns an unknown number of rows of data that I want to convert to an indexed pandas dataframe. The fastest way I know of is to write a CSV to disk then parse back in via 'read_csv'. I'm aware that it is not efficient to create an empty dataframe then constantly append new rows. I can't create a pre-sized dataframe because I do not know how many rows will be returned. Is there a way to convert the iterator output to a pandas dataframe without writing to disk?

618

asked Mar 24 '17 12:03

James

2 Answers

Pandas DataFrame accepts iterator as the data source in the constructor. You can dynamically generate rows and feed them to a data frame, as you are reading and transforming the source data.

This is easiest done by writing a generator function that uses yield to feed the results.

After the data frame has been generated you can use set_index to choose any column as an index.

Here is an example:

    def create_timeline(self) -> pd.DataFrame:
        """Create a timeline feed how we traded over a course of time.

        Note: We assume each position has only one enter and exit event, not position increases over the lifetime.

        :return: DataFrame with timestamp and timeline_event columns
        """

        # https://stackoverflow.com/questions/42999332/fastest-way-to-convert-python-iterator-output-to-pandas-dataframe
        def gen_events():
            """Generate data for the dataframe.

            Use Python generators to dynamically fill Pandas dataframe.
            Each dataframe gets timestamp, timeline_event columns.
            """
            for pair_id, history in self.asset_histories.items():
                for position in history.positions:
                    open_event = TimelineEvent(
                        pair_id=pair_id,
                        position=position,
                        type=TimelineEventType.open,
                    )
                    yield (position.opened_at, open_event)

                    # If position is closed generated two events
                    if position.is_closed():
                        close_event = TimelineEvent(
                            pair_id=pair_id,
                            position=position,
                            type=TimelineEventType.close,
                        )
                        yield (position.closed_at, close_event)

        df = pd.DataFrame(gen_events(), columns=["timestamp", "timeline_event"])
        df = df.set_index(["timestamp"])
        return df

The full open source example can be found here.

answered Oct 11 '22 20:10

Mikko Ohtamaa

Iteratively appending to a pandas data frame is not the best solution. It is better to build your data as a list, and then pass it to pd.DataFrame.

import random
import pandas as pd

alpha = list('abcdefghijklmnopqrstuvwxyz')

Here we create a generator, use it to construct a list, then pass it to the dataframe constructor:

%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
my_data = [x for x in gen]
df = pd.DataFrame(my_data, columns=['letter','value'])

# result: 1 loop, best of 3: 373 ms per loop

This is quite a bit faster than creating a generator, construct an empty dataframe, and appending rows, seen here:

%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
df = pd.DataFrame(columns=['letter','value'])
for tup in gen:
    df.loc[df.shape[0],:] = tup

# result: 1 loop, best of 3: 13.6 s per loop

This is incredibly slow at 13 seconds to construct 10000 rows.

answered Oct 11 '22 21:10

James

Related questions
                            
                                Parallel jobs don't finish in scikit-learn's GridSearchCV
                            
                                Recursive factorial using dict causes RecursionError
                            
                                Extensive list of Jenkins job statuses?
                            
                                Tensorflow error "shape Tensorshape() must have rank 1"
                            
                                Django custom login page
                            
                                xgboost: AttributeError: 'DMatrix' object has no attribute 'handle'
                            
                                Specify file pattern in pysftp get
                            
                                Drawing angled rectangles in OpenCV
                            
                                Format in python by variable length
                            
                                tensorflow cifar10_eval.py error:RuntimeError: Attempted to use a closed Session.RuntimeError: Attempted to use a closed Session
                            
                                How does pip decide which many linux wheel to use?
                            
                                ValueError: Cannot cast DatetimeIndex to dtype datetime64[us]
                            
                                Apply fuzzy matching across a dataframe column and save results in a new column
                            
                                Pandas - scatter matrix set title
                            
                                Python requests and streaming - AttributeError: 'X509' object has no attribute '_x509'
                            
                                Converting Float to Int on certain columns in a data frame
                            
                                SKlearn SGD Partial Fit
                            
                                plotly plots in jupyter notebooks: Validation fails when saving
                            
                                Pygame: How to correctly use get_rect()
                            
                                Creating a new Folder with given path [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to convert python iterator output to pandas dataframe

Tags:

python

pandas

James

People also ask

2 Answers

Mikko Ohtamaa

James

Recent Activity

Donate For Us