I currently am trying to run a logistic regression model. My data has two variables, one response variable and one predictor variable. The catch is that I have 200 million observations. I am trying to run a logistic regression model but am having extremely difficulty doing so in R/Stata/MATLAB even with the help of EC2 instances on Amazon. I believe the problem lies in how the logistic regression functions are defined in the language itself. Is there another way to run a logistic regression quickly? Currently the problem I have is that my data quickly fills up whatever space it is using. I have even tried using up to 30 GB of RAM to no avail. Any solutions would be greatly welcome.
Our results, on a large life sciences dataset, indicate that logistic regression can perform surprisingly well, both statistically and computationally, when compared with an array of more recent classification algorithms.
Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses maximum likelihood estimation (MLE) to obtain the model coefficients that relate predictors to the target.
There are three main types of logistic regression: binary, multinomial and ordinal.
For observational studies with large population size that involve logistic regression in the analysis, taking a minimum sample size of 500 is necessary to derive the statistics that represent the parameters.
If your main issue is the ability to estimate a logit model given computer memory constraints, and not the quickness of the estimation, you can take advantage of the additivity of maximum likelihood estimation and write a custom program for ml. A logit model is simply a maximum likelihood estimation using the logistic distribution. The fact that you have only one independent variable simplifies this problem. I've simulated the problem below. You should create two do files out of the following code blocks.
If you have no issue loading in the whole dataset - which you shouldn't, my simulation only used ~2 gigs of ram using 200 million obs and 2 vars, though mileage may vary - the first step would be to break down the dataset into manageable pieces. For instance:
depvar = your dependent variable (0 or 1s) indepvar = your independent variable (some numeric data type)
cd "/path/to/largelogit"
clear all
set more off
set obs 200000000
// We have two variables, and independent variable and a dependent variable.
gen indepvar = 10*runiform()
gen depvar = .
// As indpevar increases, the probability of depvar being 1 also increases.
replace depvar = 1 if indepvar > ( 5 + rnormal(0,2) )
replace depvar = 0 if depvar == .
save full, replace
clear all
// Need to split the dataset into managable pieces
local max_opp = 20000000 // maximum observations per piece
local obs_num = `max_opp'
local i = 1
while `obs_num' == `max_opp' {
clear
local h = `i' - 1
local obs_beg = (`h' * `max_opp') + 1
local obs_end = (`i' * `max_opp')
capture noisily use in `obs_beg'/`obs_end' using full
if _rc == 198 {
capture noisily use in `obs_beg'/l using full
}
if _rc == 198 {
continue,break
}
save piece_`i', replace
sum
local obs_num = `r(N)'
local i = `i' + 1
}
From here to minimize your memory usage close Stata and reopen it. When you create such large datasets Stata keeps some memory allocated for overhead etc. even if you clear the dataset. You can type memory
after the save full
and after the clear all
to see what I mean.
Next you must define your own custom ml program which will feed in each of these pieces one at a time within the program, calculate and sum the log-likelihoods of each observation for each piece and add them all together. You need to use the d0
ml method
as opposed to the lf
method because the optimizing routine with lf
requires all data used to be loaded into the Stata.
clear all
set more off
cd "/path/to/largelogit"
// This local stores the names of all the pieces
local p : dir "/path/to/largelogit" files "piece*.dta"
local i = 1
foreach j of local p { // Loop through all the names to count the pieces
global pieces = `i' // This is important for the program
local i = `i' + 1
}
// Generate our custom MLE logit progam. This is using the d0 ml method
program define llogit_d0
args todo b lnf
tempvar y xb llike tot_llike it_llike
quietly {
forvalues i=1/$pieces {
capture drop _merge
capture drop depvar indepvar
capture drop `y'
capture drop `xb'
capture drop `llike'
capture scalar drop `it_llike'
merge 1:1 _n using piece_`i'
generate int `y' = depvar
generate double `xb' = (indepvar * `b'[1,1]) + `b'[1,2] // The linear combination of the coefficients and independent variable and the constant
generate double `llike' = .
replace `llike' = ln(invlogit( `xb')) if `y'==1 // the log of the probability should the dependent variable be 1
replace `llike' = ln(1-invlogit(`xb')) if `y'==0 // the log of the probability should the dependent variable be 0
sum `llike'
scalar `it_llike' = `r(sum)' // The sum of the logged probabilities for this iteration
if `i' == 1 scalar `tot_llike' = `it_llike' // Total log likelihood for first iteration
else scalar `tot_llike' = `tot_llike' + `it_llike' // Total log likelihood is the sum of all the iterated log likelihoods `it_llike'
}
scalar `lnf' = `tot_llike' // The total log likelihood which must be returned to ml
}
end
//This should work
use piece_1, clear
ml model d0 llogit_d0 (beta : depvar = indepvar )
ml search
ml maximize
I just ran the above two blocks of code and received the following output:
Pros and Cons of this approach:
Pro:
Con:
To test if the coefficients truly are the same as a standard logit, set obs
to something relatively small, 100000, and set max_opp
to something like 1000. Run my code, look at the output, run logit depvar indepvar
, look at the output, they are the same other than what I mention in "Cons" above. Setting obs
to the same as max_opp
will correct Wald Chi2 statistics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With