Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with issue merging two datasets?

Tags:

merge

r

I am working with two datasets in R: svolik and est. For context, I have developed a new measure of a concept (legislative power sharing), and I am using it to replicate a previous study: Svolik (2012). The goal of the exercise is to see if the results are different when using my measure.

Here is the svolik data: https://drive.google.com/file/d/1nCBhRXNcBrLEr6-R2pkyuQ9mCtJKkdmm/view?usp=sharing

Here is the est data: https://drive.google.com/file/d/1D-UmHSi9LIEsmY5VBvU8nxu8u1gix7Ay/view?usp=sharing

I started with the dataset that Svolik used to generate his results. I successfully reproduced his results (models 1, 3, and 5 in the figure). I then merged his dataset with the dataset containing my new measure, discarding any observations for which there was no exact match:

# load original data (the data used to produce original results)
svolik <- read_dta("svolik.dta")

# load data containing my new measure
est <- read.csv("Merging with Svolik.csv")

# merge
final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)

Next, I run his models again, but I replace his Legislature variable with my Legislative Power Sharing variable (models 2, 4, and 6 in the figure). Notice that, despite the data covering the same period of time, the original models and my own contain slightly different numbers of observations (2,903 as opposed to 2,934).

I cannot for the life of me figure out why I am getting these extra observations. My guess is that it has something to do with merging/duplicates or something like that. Does this seem like the likely problem to you? If so, do you know of a way to find out what those observations are? The solution is probably quite simple, and I am probably just overthinking things. Any advice would be appreciated! Note that I tried using a different merging strategy --- left_join in dplyr()--- but that didn't work.

enter image description here

Note that I am running the results in Stata. Here is the Stata code for the original results (i.e., models 1, 3, and 5):

* SURVIVAL ANALYSIS

use "leaders, institutions, covariates, updated tvc.dta" 

* NATURAL DEATHS
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace  ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

* COUPS
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)
* REMOVE SOM DUPLICATE OBSERVATIONS
* drop if (t[_n-1]==t &  leadid[_n-1]== leadid) 
stset t, id(leadid) failure(c_coup)

stcox legislature  lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef
 

* REVOLTS
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt

stset t, id(leadid) failure(c_revolt)
* COMMUNIST LEFT OUT BECAUSE IT IS A PERFECT PREDICTOR

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED  mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef

Here is the Stata code for the new results (i.e., models 2, 4, and 6):

* SURVIVAL ANALYSIS

use "merged_test.dta" 

* NATURAL DEATHS
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace  ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

* COUPS
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)
* REMOVE SOM DUPLICATE OBSERVATIONS
* drop if (t[_n-1]==t &  leadid[_n-1]== leadid) 
stset t, id(leadid) failure(c_coup)

stcox estimate  lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef
 

* REVOLTS
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt

stset t, id(leadid) failure(c_revolt)
* COMMUNIST LEFT OUT BECAUSE IT IS A PERFECT PREDICTOR

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED  mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef
like image 596
w5698 Avatar asked Oct 24 '25 01:10

w5698


1 Answers

Explanation of the problem

Your problem is not that svolik has 2903 observations, final has 2934 and therefore final is a superset of svolik caused by some duplicate rows in the merge. You will never have 2903 rows that you can use in both regressions. This is because not all complete observations used in the Svolik regression have a corresponding row in the est dataset. Firstly let's understand where the 2903 observations in svolik come from:

svolik_reg_cols <- c("legislative", "lgdp_1", "growth_1", "exportersoffuelsmainlyoil_EL2008", "ethfrac_FIXED", "communist", "mil", "cw", "age")
svolik_is_complete <- complete.cases(svolik[, svolik_reg_cols])
sum(svolik_is_complete) # 2903

As you can see, it is the number of complete cases for all the columns in the regression. Now let's do the same with final, using your join method:

final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)
final_reg_cols <- svolik_reg_cols
final_reg_cols[final_reg_cols == "legislative"] <- "estimate"
final_is_complete <- complete.cases(final[, final_reg_cols])
sum(final_is_complete) # 2934

Again, 2934 is the number of observations with no missing data for any of the covariates.

However, let's look at the datasets you are joining. There are 278 sets of ccode and year in svolik that do not appear in est.

# How many ccode and year are in svolik but not est
dplyr::anti_join(
    svolik,
    est,
    by = c("ccode", "year")
) |>
    group_by(ccode, cabb, year) |>
    summarise(n = n()) |>
    arrange(desc(n)) |>
    print(n = 2)

# # A tibble: 278 × 3
# # Groups:   ccode [39]
#   ccode  year     n
#   <dbl> <dbl> <int>
# 1   990  1982     4
# 2   947  2001     3
# # … with 276 more rows

This means that with the data you have, it is impossible to compare the results across all observations.

Solutions

You have three options:

  1. Get more data.
  2. Impute missing values.
  3. Restrict the regression to common observations.

You will know whether 1. or 2. are possible. However as the purpose of your analysis seems to be to compare your new metric to Svolik, 3. seems a reasonable approach, particularly as you do not end up dropping many rows.

First find the common rows (there are 2830) and save to dta:

all_complete <- complete.cases(final[, c("estimate", svolik_reg_cols)])
sum(all_complete) # 2830
final_complete <- final[all_complete, ]
write_dta(final_complete, "./tmp/svolik_est_merged.dta")

Stata code

You can now run the regression in Stata. Firstly load and prepare the data as previously:

use svolik_est_merged.dta, clear

* NATURAL DEATHS
cap drop c_natural c_coup c_revolt _d _t _t0
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

Now run the Svolik regression. You can see there are 2830 observations:

stcox legislative lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
Cox regression with Breslow method for ties

No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.46
Log likelihood = -157.48569                             Prob > chi2   = 0.0008

-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
        legislative |   1.006541   .0088251     0.74   0.457     .9893923    1.023988
             lgdp_1 |   1.437144   .3138694     1.66   0.097     .9366983    2.204962
           growth_1 |   1.010814   .0283629     0.38   0.701      .956725    1.067962
exportersoffue~2008 |   2.487166   1.205382     1.88   0.060     .9620061    6.430308
      ethfrac_FIXED |   1.011694     .00645     1.82   0.068     .9991306    1.024415
          communist |     2.0526   1.610128     0.92   0.359     .4411573    9.550262
                mil |    1.06844   .3944057     0.18   0.858     .5182463    2.202744
                 cw |    4.15784   2.325053     2.55   0.011     1.389562    12.44106
                age |   1.057077   .0172812     3.40   0.001     1.023744    1.091496
-------------------------------------------------------------------------------------

Then run your regression:

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age

Output:

Cox regression with Breslow method for ties

No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.00
Log likelihood = -157.71273                             Prob > chi2   = 0.0010

-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
           estimate |   .9742007   .1278445    -0.20   0.842     .7532603    1.259946
             lgdp_1 |   1.506868   .3265272     1.89   0.058     .9854309    2.304222
           growth_1 |   1.007996    .028074     0.29   0.775      .954447     1.06455
exportersoffue~2008 |   2.147553   1.257702     1.31   0.192     .6814636    6.767761
      ethfrac_FIXED |   1.011719   .0070275     1.68   0.093     .9980384    1.025587
          communist |   2.064115   1.619767     0.92   0.356     .4433766    9.609369
                mil |   1.018648   .3747256     0.05   0.960     .4953321    2.094845
                 cw |   3.961413   2.202203     2.48   0.013     1.332464    11.77727
                age |   1.054575   .0174756     3.21   0.001     1.020873    1.089389
-------------------------------------------------------------------------------------

Again 2830 observations. The results seem pretty similar to me: the same two covariates (cw and age) have small p-values and all coefficients are close to Svolik. If you're trying to develop a metric which tells you something new, perhaps not what you want to hear. However, if you are trying to find out whether your metric is robust by comparing to an established one, perhaps that's better news.

One caveat of course is to think about whether the rows that you cannot join are missing completely at random. If there is a systemic issue causing data to be missing related to your dependent variable (e.g. authoritarian countries are more likely to be prevent data being published for certain years), it may not be reasonable to infer that similar results here can be extrapolated to missing data. One way to look at this would be to compare your Svolik results with the 2830 rows to the results with the 2903 rows and see if they are meaningfully different.

Edit: how to see which rows are not included in both

To answer your question in the comments. You wanted to see which c_natural failure was included in svolik but not final:

# Create dataset of complete svolik observations
svolik_complete <- svolik[svolik_is_complete, ]

# Subset data to failure of interest
svolik_failures <- svolik_complete |>
    filter(c_natural == 1)
final_failures <- final_complete |>
    filter(c_natural == 1)

# Find failures in svolik_complete but not in final_complete
anti_join(svolik_failures, final_failures, by = c("ccode", "year")) |>
    select(ccode, cabb, year)
# # A tibble: 1 × 3
#   ccode cabb   year
#   <dbl> <chr> <dbl>
# 1   640 TUR    1972 

You can do the same for c_coup or c_revolt.

like image 126
SamR Avatar answered Oct 26 '25 14:10

SamR



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!