I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:
%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')
and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().
My questions are:
In Excel, you would choose file then open and then for files of type select comma separated file (Excel expects those files to have a . csv extension). You can then click the file and open it in Excel. You can learn more about this by seeing the Stata help file for outsheet.
To import . dat files in the R Language, we use the read_dta() function from the haven package library to read .
DATA files are commonly used to store data for offline data analysis when not connected to an Analysis Studio server, but may also be used in online mode. Due to their tab-delimited format, DATA files may be imported using pandas via read_csv function once their header information is stripped.
There is a simpler way to solve it using Pandas' built-in function read_stata
.
Assume your large file is named as large.dta
.
import pandas as pd
reader=pd.read_stata("large.dta",chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df=df.append(itm)
df.to_csv("large.csv")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With