Efficiently construct Pandas DataFrame from large list of tuples/rows

Tags:

I've inherited a data file saved in the Stata .dta format. I can load it in with scikits.statsmodels genfromdta() function. This puts my data into a 1-dimensional NumPy array, where each entry is a row of data, stored in a 24-tuple.

In [2]: st_time = time.time(); initialload = sm.iolib.genfromdta("/home/myfile.dta"); ed_time = time.time(); print (ed_time - st_time)
666.523324013

In [3]: type(initialload)
Out[3]: numpy.ndarray

In [4]: initialload.shape
Out[4]: (4809584,)

In [5]: initialload[0]
Out[5]: (19901130.0, 289.0, 1990.0, 12.0, 19901231.0, 18.0, 40301000.0, 'GB', 18242.0, -2.368063, 1.0, 1.7783716290878204, 4379.355, 66.17669677734375, -999.0, -999.0, -0.60000002, -999.0, -999.0, -999.0, -999.0, -999.0, 0.2, 371.0)

I am curious if there's an efficient way to arrange this into a Pandas DataFrame. From what I've read, building up a DataFrame row-by-row seems quite inefficient... but what are my options?

I've written a pretty slow first-pass that just reads each tuple as a single-row DataFrame and appends it. Just wondering if anything else is known to be better.

843

asked Jul 10 '12 14:07

ely

2 Answers

pandas.DataFrame(initialload, columns=list_of_column_names)

167

answered Nov 11 '22 14:11

eumiro

Version 0.12 of pandas onwards should support loading Stata format directly (Reference).

From the documentation:

The top-level function read_stata will read a dta format file and return a DataFrame: The class StataReader will read the header of the given dta file at initialization. Its method data() will read the observations, converting them to a DataFrame which is returned:

 pd.read_stata('stata.dta')

answered Nov 11 '22 14:11

saffsd

Related questions
                            
                                Finding three integers such that their sum of cosine values become max
                            
                                How to pass string format as a variable to an f-string
                            
                                how to add the path to PYTHONPATH in google colab
                            
                                Type hints and chained assignment and multiple assignments
                            
                                How to change only the maximum value of a group in pandas dataframe
                            
                                Python 3.10 pattern matching (PEP 634) - wildcard in string
                            
                                Binary data with pyserial(python serial port)
                            
                                Google App Engine Python Unit Tests
                            
                                How does sympy work? How does it interact with the interactive Python shell, and how does the interactive Python shell work?
                            
                                AttributeError: 'module' object has no attribute (when using cPickle)
                            
                                non-technical benefits of having string-type immutable
                            
                                ValueError: need more than 2 values to unpack in Python 2.6.6
                            
                                wtforms Form class subclassing and field ordering
                            
                                Python Multiprocessing with PyCUDA
                            
                                "Unrolling" a recursive function?
                            
                                Django+Nginx+uWSGI = 504 Gateway Time-out
                            
                                Syntax Highlighting with Pygments is failing via Liquid Templates String Error
                            
                                Python3: What is the difference between keywords and builtins?
                            
                                Convert numpy array to PySide QPixmap
                            
                                How do you install Python Xlib with pip?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently construct Pandas DataFrame from large list of tuples/rows

Tags:

python

pandas

tuples

dta

ely

People also ask

2 Answers

eumiro

saffsd

Recent Activity

Donate For Us