Python - Efficient way to add rows to dataframe

Tags:

From this question and others it seems that it is not recommended to use concat or append to build a pandas dataframe because it is recopying the whole dataframe each time.

My project involves retrieving a small amount of data every 30 seconds. This might run for a 3 day weekend, so someone could easily expect over 8000 rows to be created one row at a time. What would be the most efficient way to add rows to this dataframe?

911

asked Jan 27 '17 06:01

Jarrod

Video Answer

2 Answers

I used this answer's df.loc[i] = [new_data] suggestion, but I have > 500,000 rows and that was very slow.

While the answers given are good for the OP's question, I found it more efficient, when dealing with large numbers of rows up front (instead of the tricking in described by the OP) to use csvwriter to add data to an in memory CSV object, then finally use pandas.read_csv(csv) to generate the desired DataFrame output.

from io import BytesIO from csv import writer  import pandas as pd  output = BytesIO() csv_writer = writer(output)  for row in iterable_object:     csv_writer.writerow(row)  output.seek(0) # we need to get back to the start of the BytesIO df = pd.read_csv(output) return df

This, for ~500,000 rows was 1000x faster and as the row count grows the speed improvement will only get larger (the df.loc[1] = [data] will get a lot slower comparatively)

Hope this helps someone who need efficiency when dealing with more rows than the OP.

answered Sep 19 '22 05:09

Tom Harvey

Editing the chosen answer here since it was completely mistaken. What follows is an explanation of why you should not use setting with enlargement. "Setting with enlargement" is actually worse than append.

The tl;dr here is that there is no efficient way to do this with a DataFrame, so if you need speed you should use another data structure instead. See other answers for better solutions.

More on setting with enlargement

You can add rows to a DataFrame in-place using loc on a non-existent index, but that also performs a copy of all of the data (see this discussion). Here's how it would look, from the Pandas documentation:

In [119]: dfi Out[119]:     A  B  C 0  0  1  0 1  2  3  2 2  4  5  4  In [120]: dfi.loc[3] = 5  In [121]: dfi Out[121]:     A  B  C 0  0  1  0 1  2  3  2 2  4  5  4 3  5  5  5

For something like the use case described, setting with enlargement actually takes 50% longer than append:

With append(), 8000 rows took 6.59s (0.8ms per row)

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4}) for i in range(8000):     df = df.append(new_row, ignore_index=True)  # 6.59 s ± 53.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With .loc(), 8000 rows took 10s (1.25ms per row)

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4}) for i in range(8000):     df.loc[i] = new_row  # 10.2 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What about a longer DataFrame?

As with all profiling in data-oriented code, YMMV and you should test this for your use case. One characteristic of the copy-on-write behavior of append and "setting with enlargement" is that it will get slower and slower with large DataFrames:

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4}) for i in range(16000):     df.loc[i] = new_row  # 23.7 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Building a 16k row DataFrame with this method takes 2.3x longer than 8k rows.

answered Sep 21 '22 05:09

sundance

Related questions
                            
                                Making sure that psycopg2 database connection alive
                            
                                Count letter frequency in word list, excluding duplicates in the same word
                            
                                Text difference algorithm
                            
                                How to render an ordered dictionary in django templates?
                            
                                How to plot confusion matrix with string axis rather than integer in python
                            
                                TypeError: 'builtin_function_or_method' object is not subscriptable
                            
                                How to Print next year from current year in Python
                            
                                Python: convert list to generator
                            
                                Python 2.7 : Write to file instantly
                            
                                How to create a list of objects?
                            
                                PySpark create new column with mapping from a dict
                            
                                Django Rest-Framework nested serializer order
                            
                                How would I package and sell a Django app?
                            
                                Scrapy spider not found error
                            
                                Tensorflow Precision / Recall / F1 score and Confusion matrix
                            
                                Pattern matching of lists in Python
                            
                                How to get a list of all the fonts currently available for Matplotlib?
                            
                                How to enable MySQL client auto re-connect with MySQLdb?
                            
                                Key Listeners in python?
                            
                                How to display list of running processes Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - Efficient way to add rows to dataframe

Tags:

python

pandas

dataframe

numpy