Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_stata() with large .dta files

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:

%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')

and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().

My questions are:

  1. Is there something I am doing wrong that is resulting in Pandas having issues?
  2. Is there a workaround to get the data into a Pandas dataframe?
like image 296
Jonathan Avatar asked Nov 02 '13 17:11

Jonathan


People also ask

How do I save a DTA file as a CSV?

In Excel, you would choose file then open and then for files of type select comma separated file (Excel expects those files to have a . csv extension). You can then click the file and open it in Excel. You can learn more about this by seeing the Stata help file for outsheet.

Which package can be used read .DTA files into R environment?

To import . dat files in the R Language, we use the read_dta() function from the haven package library to read .

Can pandas read .data file?

DATA files are commonly used to store data for offline data analysis when not connected to an Analysis Studio server, but may also be used in online mode. Due to their tab-delimited format, DATA files may be imported using pandas via read_csv function once their header information is stripped.


1 Answers

There is a simpler way to solve it using Pandas' built-in function read_stata.

Assume your large file is named as large.dta.

import pandas as pd

reader=pd.read_stata("large.dta",chunksize=100000)

df = pd.DataFrame()

for itm in reader:
    df=df.append(itm)

df.to_csv("large.csv")
like image 115
Jinhua Wang Avatar answered Sep 23 '22 11:09

Jinhua Wang