Pandas read_stata() with large .dta files

Tags:

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:

%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')

and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().

My questions are:

Is there something I am doing wrong that is resulting in Pandas having issues?
Is there a workaround to get the data into a Pandas dataframe?

296

asked Nov 02 '13 17:11

Jonathan

1 Answers

There is a simpler way to solve it using Pandas' built-in function read_stata.

Assume your large file is named as large.dta.

import pandas as pd

reader=pd.read_stata("large.dta",chunksize=100000)

df = pd.DataFrame()

for itm in reader:
    df=df.append(itm)

df.to_csv("large.csv")

115

answered Sep 23 '22 11:09

Jinhua Wang

Related questions
                            
                                Iteration over variable names in python?
                            
                                Is there an easy way generate a probable list of words from an unspaced sentence in python?
                            
                                Upper/lower limits with matplotlib
                            
                                Is there a standard way to store XY data in Python?
                            
                                Reading Multiple CSV Files into Python Pandas Dataframe
                            
                                Get progress from async python celery chain by chain id
                            
                                How to store application settings across modules [duplicate]
                            
                                Efficient extraction of a subgraph according to some edge attribute in NetworkX
                            
                                flask jinja2 href not linking correctly
                            
                                Imported modules become None when running a function
                            
                                Periodogram in Octave/Matlab vs Scipy
                            
                                Summing across rows of Pandas Dataframe
                            
                                writing back into the same file after reading from the file
                            
                                POST request with Multipart/form-data. Content-type not correct
                            
                                How to match a paragraph using regex
                            
                                How to pass complex objects across view functions/sessions in Flask
                            
                                Python: Find first non-matching character
                            
                                How to make a serial port sniffer sniffing physical port using a python
                            
                                Python argparse parse_args into global namespace (or a reason this is a bad idea)
                            
                                What's the difference between the mro method and the __mro__ attribute of a class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas read_stata() with large .dta files

Tags:

python

pandas

stata

Jonathan

People also ask

1 Answers

Jinhua Wang

Recent Activity

Donate For Us