R biglm with categorical variables

Tags:

r

I have a large data set I working with in R using some of the big.___() packages. It's ~ 10 gigs (100mmR x 15C) and looks like this:

Price         Var1         Var2
12.45          1             1
33.67          1             2
25.99          3             3
14.89          2             2
23.99          1             1
...            ...          ...

I am trying to predict price based on Var1 and Var2.

The problem I've come up with is that Var1 and Var2 are categorical / factor variables.
Var1 and Var2 each have 3 levels (1,2 and 3) but there are only 6 combinations in the data set

(1,1;  1,2;  1,3;  2,2;  2,3;  3,3)

To use factor variables in biglm() they must be present in each chunk of data that biglm uses (my understanding is that biglm breaks the data set into 'x' number of chunks and updates the regression parameters after analyzing each chunk in order to get around dealing with data sets that are larger than RAM).

I've tried to subset the data but my computer can't handle it or my code is wrong:

bm11 <- big.matrix(150000000, 3)
bm11 <- subset(x, x[,2] == 1 & x[,3] == 1)

The above gives me a bunch of these:

Error: cannot allocate vector of size 1.1 Gb

Does anyone have any suggestions for working around this issue?

I'm using R 64-bit on a windows 7 machine w/ 4 gigs of RAM.

273

asked May 08 '12 16:05

screechOwl

1 Answers

You do not need all the data or all values present in each chunk, you just need all the levels accounted for. This means that you can have a chunk like this:

curchunk <- data.frame( Price=c(12.45, 33.67), Var1=factor( c(1,1), levels=1:3), 
  Var2 = factor( 1:2, levels=1:3 ) )

and it will work. Even though there is only 1 value in Var1 and 2 values in Var2, all three levels are present in both so it will do the correct thing.

Also biglm does not break the data into chunks for you, but expects you to give it manageble chunks to work with. Work through the examples to see this better. A common methodology with biglm is to read from a file or database, read in the first 'n' rows (where 'n' is a reasonble subset) and pass them to biglm (possibly after making sure all the factors have all the levels specified), then remove that chunk of data from memory and read in the next 'n' rows and pass that to update, continues with this until the end of the file removing the used chunks each time (so you have enough memory room for the next one).

105

answered Oct 11 '22 00:10

Greg Snow

Related questions
                            
                                How do I keep my subtitles when I use ggplotly()
                            
                                How to locate errors and debug when using purrr
                            
                                Line density heatmap in R
                            
                                How to run function on the deepest level only in a nested list?
                            
                                Using pivot_longer with multiple paired columns in the wide dataset
                            
                                Names of nested list containing dots (e.g. "c.2)
                            
                                Formula for all first and second order predictors including interactions in R
                            
                                Drawing a heatmap in R based on zipcodes only
                            
                                How can I change the default theme in ggplot2?
                            
                                Calculate monthly average of ts object
                            
                                How to improve a spatial raster map using ggplot when compared to spplot?
                            
                                plot function does not take plot type into account in R language
                            
                                have R halt the EC2 machine it's running on
                            
                                Make R (statistics package) wait for keyboard prompt when run within a bash script
                            
                                save yaxis legends as a separate grob?
                            
                                Simple if-else loop in R
                            
                                How can I use different color palettes for different layers in ggplot2?
                            
                                Getting both column counts and proportions in the same table in R
                            
                                Accessing google docs revision history through the API using R?
                            
                                Subset data /extracting data based on first 7 letters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With