I am running a matching procedure in R, using the MatchIt package. I use propensity score match, that is: estimate treatment selection by logit, and pick the nearest match.
The dataset is huge (4million rows), is there no way to speed it up?
To make it clear what I have done:
require(MatchIt)
m.out <- matchit(treatment ~ age + agesq + male + income + ..., data = data, metod = "nearest")
I was similarly frustrated but found a solution for my case.
Essentially, I found a substantial run-time reduction by splitting the propensity score matching into 3 steps:
library(MatchIt)
library(tidyverse)
library(dplyr)
#step 1
data$myfit <- fitted(glm(treatment ~ age + agesq + male + income + ..., data = data, family = "binomial"))
#step 2
trimmed_data <- select(data, unique_id, myfit, treatment)
#step 3
m.out <- matchit(treatment ~ unique_id, data = trimmed_data, method = "nearest", distance = trimmed_data$myfit)
matched_unique_ids_etc <- match.data(m.out, data = trimmed_data)
matched_unique_ids <- select(matched_unique_ids_etc, unique_id)
matched_data <- matched_unique_ids %>% inner_join(data)
The formula does not affect the nearest-neighbor matching process.
The default distance/link for matchit was glm/logit when I wrote this, so the code above is applicable to that case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With