Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently construct Pandas DataFrame from large list of tuples/rows

I've inherited a data file saved in the Stata .dta format. I can load it in with scikits.statsmodels genfromdta() function. This puts my data into a 1-dimensional NumPy array, where each entry is a row of data, stored in a 24-tuple.

In [2]: st_time = time.time(); initialload = sm.iolib.genfromdta("/home/myfile.dta"); ed_time = time.time(); print (ed_time - st_time)
666.523324013

In [3]: type(initialload)
Out[3]: numpy.ndarray

In [4]: initialload.shape
Out[4]: (4809584,)

In [5]: initialload[0]
Out[5]: (19901130.0, 289.0, 1990.0, 12.0, 19901231.0, 18.0, 40301000.0, 'GB', 18242.0, -2.368063, 1.0, 1.7783716290878204, 4379.355, 66.17669677734375, -999.0, -999.0, -0.60000002, -999.0, -999.0, -999.0, -999.0, -999.0, 0.2, 371.0)

I am curious if there's an efficient way to arrange this into a Pandas DataFrame. From what I've read, building up a DataFrame row-by-row seems quite inefficient... but what are my options?

I've written a pretty slow first-pass that just reads each tuple as a single-row DataFrame and appends it. Just wondering if anything else is known to be better.

like image 843
ely Avatar asked Jul 10 '12 14:07

ely


People also ask

How do you create a DataFrame from a list of tuples?

Pandas with Python We need to send this list of tuples as a parameter to the pandas. DataFrame() function. The Pandas DataFrame object will store the data in a tabular format, Here the tuple element of the list object will become the row of the resultant DataFrame.

How do you convert a list of tuples to a Pandas DataFrame?

To convert a Python tuple to DataFrame, use the pd. DataFrame() constructor that accepts a tuple as an argument and it returns a DataFrame.

What is the most efficient way to loop through DataFrames with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Is pandas DataFrame faster than list?

Results. From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.


2 Answers

pandas.DataFrame(initialload, columns=list_of_column_names)
like image 167
eumiro Avatar answered Nov 11 '22 14:11

eumiro


Version 0.12 of pandas onwards should support loading Stata format directly (Reference).

From the documentation:

The top-level function read_stata will read a dta format file and return a DataFrame: The class StataReader will read the header of the given dta file at initialization. Its method data() will read the observations, converting them to a DataFrame which is returned:

 pd.read_stata('stata.dta')
like image 44
saffsd Avatar answered Nov 11 '22 14:11

saffsd