Why is an R object so much larger than the same data in Stata/SPSS?

Tags:

I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.

I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.

Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.

e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'

R code :

require(survey)
sv.design <- svydesign(ids =  ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))

pushes the memory used by R upto 6 GB and takes a very long time - user system elapsed 3.70 1.75 144.11

The equivalent operation in stata

svy: mean hhpers if htype == 1

completes almost instantaneously, giving me the same result.

Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata? Is there anything I can do to optimise the data and how R is working with it?

ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.

After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.

664

asked Apr 23 '15 11:04

bldysabba

1 Answers

Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).

Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with

hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument

Sorry, I'm not familiar with survey data studies, so I can't be more specific.

Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.

128

answered Nov 15 '22 21:11

Sergii Zaskaleta

Related questions
                            
                                R: source() and path to source files
                            
                                Find most recent file in a directory (in Windows System) in R [duplicate]
                            
                                How to align a group of checkboxGroupInput in R Shiny
                            
                                Zero divison in R
                            
                                Install ncdf4 package: Error, nc-config not found or not executable
                            
                                Accessing same named list elements of the list of lists in R
                            
                                Efficient multiplication of columns in a data frame
                            
                                What is the most efficient way to sum all columns whose name starts with a pattern?
                            
                                Assign Value to Diagonal Entries of Matrix
                            
                                Capitalizing letters. R equivalent of excel "PROPER" function [duplicate]
                            
                                Gantt style time line plot (in base R)
                            
                                How do I convert the three letter amino acid codes to one letter code with python or R?
                            
                                R packages in private organization - how to install private dependencies within organization?
                            
                                strange characters: interaction of R and Windows locale?
                            
                                Emacs tab auto-complete for R data.table?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is an R object so much larger than the same data in Stata/SPSS?

Tags:

memory

r

survey

bldysabba

People also ask

1 Answers

Sergii Zaskaleta

Recent Activity

Donate For Us