I've inherited a data file saved in the Stata .dta format. I can load it in with scikits.statsmodels
genfromdta()
function. This puts my data into a 1-dimensional NumPy array, where each entry is a row of data, stored in a 24-tuple.
In [2]: st_time = time.time(); initialload = sm.iolib.genfromdta("/home/myfile.dta"); ed_time = time.time(); print (ed_time - st_time)
666.523324013
In [3]: type(initialload)
Out[3]: numpy.ndarray
In [4]: initialload.shape
Out[4]: (4809584,)
In [5]: initialload[0]
Out[5]: (19901130.0, 289.0, 1990.0, 12.0, 19901231.0, 18.0, 40301000.0, 'GB', 18242.0, -2.368063, 1.0, 1.7783716290878204, 4379.355, 66.17669677734375, -999.0, -999.0, -0.60000002, -999.0, -999.0, -999.0, -999.0, -999.0, 0.2, 371.0)
I am curious if there's an efficient way to arrange this into a Pandas DataFrame. From what I've read, building up a DataFrame row-by-row seems quite inefficient... but what are my options?
I've written a pretty slow first-pass that just reads each tuple as a single-row DataFrame and appends it. Just wondering if anything else is known to be better.
Pandas with Python We need to send this list of tuples as a parameter to the pandas. DataFrame() function. The Pandas DataFrame object will store the data in a tabular format, Here the tuple element of the list object will become the row of the resultant DataFrame.
To convert a Python tuple to DataFrame, use the pd. DataFrame() constructor that accepts a tuple as an argument and it returns a DataFrame.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
Results. From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.
pandas.DataFrame(initialload, columns=list_of_column_names)
Version 0.12 of pandas onwards should support loading Stata format directly (Reference).
From the documentation:
The top-level function read_stata will read a dta format file and return a DataFrame: The class StataReader will read the header of the given dta file at initialization. Its method data() will read the observations, converting them to a DataFrame which is returned:
pd.read_stata('stata.dta')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With