Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is an efficient way of running a logistic regression for large data sets (200 million by 2 variables)?

I currently am trying to run a logistic regression model. My data has two variables, one response variable and one predictor variable. The catch is that I have 200 million observations. I am trying to run a logistic regression model but am having extremely difficulty doing so in R/Stata/MATLAB even with the help of EC2 instances on Amazon. I believe the problem lies in how the logistic regression functions are defined in the language itself. Is there another way to run a logistic regression quickly? Currently the problem I have is that my data quickly fills up whatever space it is using. I have even tried using up to 30 GB of RAM to no avail. Any solutions would be greatly welcome.

like image 812
user1398057 Avatar asked Oct 01 '22 04:10

user1398057


People also ask

Is logistic regression good for large datasets?

Our results, on a large life sciences dataset, indicate that logistic regression can perform surprisingly well, both statistically and computationally, when compared with an array of more recent classification algorithms.

What is the method of best fit in a logistic regression?

Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses maximum likelihood estimation (MLE) to obtain the model coefficients that relate predictors to the target.

What are the 3 types of logistic regression?

There are three main types of logistic regression: binary, multinomial and ordinal.

Does logistic regression need large sample size?

For observational studies with large population size that involve logistic regression in the analysis, taking a minimum sample size of 500 is necessary to derive the statistics that represent the parameters.


1 Answers

If your main issue is the ability to estimate a logit model given computer memory constraints, and not the quickness of the estimation, you can take advantage of the additivity of maximum likelihood estimation and write a custom program for ml. A logit model is simply a maximum likelihood estimation using the logistic distribution. The fact that you have only one independent variable simplifies this problem. I've simulated the problem below. You should create two do files out of the following code blocks.

If you have no issue loading in the whole dataset - which you shouldn't, my simulation only used ~2 gigs of ram using 200 million obs and 2 vars, though mileage may vary - the first step would be to break down the dataset into manageable pieces. For instance:

depvar = your dependent variable (0 or 1s) indepvar = your independent variable (some numeric data type)

cd "/path/to/largelogit"

clear all
set more off

set obs 200000000

// We have two variables, and independent variable and a dependent variable.
gen indepvar = 10*runiform()
gen depvar = .

// As indpevar increases, the probability of depvar being 1 also increases.
replace depvar = 1 if indepvar > ( 5 + rnormal(0,2) )
replace depvar = 0 if depvar == .

save full, replace
clear all

// Need to split the dataset into managable pieces

local max_opp = 20000000    // maximum observations per piece

local obs_num = `max_opp'

local i = 1
while `obs_num' == `max_opp' {

    clear

    local h = `i' - 1

    local obs_beg = (`h' * `max_opp') + 1
    local obs_end = (`i' * `max_opp')

    capture noisily use in `obs_beg'/`obs_end' using full

    if _rc == 198 {
        capture noisily use in `obs_beg'/l using full
    }
    if _rc == 198 { 
        continue,break
    }

    save piece_`i', replace

    sum
    local obs_num = `r(N)'

    local i = `i' + 1

}

From here to minimize your memory usage close Stata and reopen it. When you create such large datasets Stata keeps some memory allocated for overhead etc. even if you clear the dataset. You can type memory after the save full and after the clear all to see what I mean.

Next you must define your own custom ml program which will feed in each of these pieces one at a time within the program, calculate and sum the log-likelihoods of each observation for each piece and add them all together. You need to use the d0 ml method as opposed to the lf method because the optimizing routine with lf requires all data used to be loaded into the Stata.

clear all
set more off

cd "/path/to/largelogit"

// This local stores the names of all the pieces 
local p : dir "/path/to/largelogit" files "piece*.dta"

local i = 1
foreach j of local p {    // Loop through all the names to count the pieces

    global pieces = `i'    // This is important for the program
    local i = `i' + 1

}

// Generate our custom MLE logit progam. This is using the d0 ml method 

program define llogit_d0

    args todo b lnf 

    tempvar y xb llike tot_llike it_llike

quietly {

    forvalues i=1/$pieces {

        capture drop _merge
        capture drop depvar indepvar
        capture drop `y'
        capture drop `xb'
        capture drop `llike' 
        capture scalar drop `it_llike'

        merge 1:1 _n using piece_`i'

        generate int `y' = depvar

        generate double `xb' = (indepvar * `b'[1,1]) + `b'[1,2]    // The linear combination of the coefficients and independent variable and the constant

        generate double `llike' = .

        replace `llike' = ln(invlogit( `xb')) if `y'==1    // the log of the probability should the dependent variable be 1
        replace `llike' = ln(1-invlogit(`xb')) if `y'==0   // the log of the probability should the dependent variable be 0

        sum `llike' 
        scalar `it_llike' = `r(sum)'    // The sum of the logged probabilities for this iteration

        if `i' == 1     scalar `tot_llike' = `it_llike'    // Total log likelihood for first iteration
        else            scalar `tot_llike' = `tot_llike' + `it_llike' // Total log likelihood is the sum of all the iterated log likelihoods `it_llike'

    }

    scalar `lnf' = `tot_llike'   // The total log likelihood which must be returned to ml

}

end

//This should work

use piece_1, clear

ml model d0 llogit_d0 (beta : depvar = indepvar )
ml search
ml maximize

I just ran the above two blocks of code and received the following output:

Large Logit Output

Pros and Cons of this approach:
Pro:

    - The smaller the `max_opp' size the lower the memory usage. I never used more than ~1 gig in with the simulator as above.
    - You end up with unbiased estimators, the full log-likelihood of estimator for the entire dataset, the correct standard errors - basically everything important for making inferences.

Con:

    - What you save in memory you must sacrifice in CPU time. I ran this on my personal laptop with Stata SE (one core) with an i5 processor and it took me overnight.
    - The Wald Chi2 statistic is wrong, but I believe you can calculate it given the correct data mentioned above
    - You don't get a Psudo R2 as you would with logit.

To test if the coefficients truly are the same as a standard logit, set obs to something relatively small, 100000, and set max_opp to something like 1000. Run my code, look at the output, run logit depvar indepvar, look at the output, they are the same other than what I mention in "Cons" above. Setting obs to the same as max_opp will correct Wald Chi2 statistics.

like image 58
Brian Albert Monroe Avatar answered Oct 25 '22 23:10

Brian Albert Monroe