Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to adapt the 'Oaxaca' package regression model to make the results independent from indicator variables' reference categories?

Tags:

r

oaxaca

I am writing a paper on the gender pay gap in Lithuania, and my goal is to interpret statistical survey data, determining the factors partially explaining the wage gap (such as age, tenure, education, etc.), using the Oaxaca-Blinder decomposition. I have very little knowledge of 'R', although in University I did have some classes, mostly about linear regression models. Please excuse if my questions are not well-formulated. Any comments and advice will be greatly appreciated.

I came across the 'Oaxaca' package for 'R', but have not been able to fully adapt the 'formula' function to my data. The instructions of the package: https://cran.r-project.org/web/packages/oaxaca/oaxaca.pdf

My problem is not understanding how to properly use the 'formula' function for my data, which contains a lot of non-numeric variables that I tried to turn into indicator ("dummy") variables with values of "0" or "1".

Specifically, I cannot adjust the formula to make the result invariant to the selected reference category. When I try to do this, I get the error message: "Variables d1 + d2 + d3 + ... in argument 'formula' must indicate membership in mutually exclusive categories."

The 'Oaxaca' formula that more or less works for me looks like this:

1) y ~ x1 + x2 + x3 + ... | z

Here y is the dependent variable, x1 + x2 + x3 + ... are explanatory variables and z is an indicator variable that states whether an observation belongs to Group B (female) or group A (male).

The formula adjusted for reference category:

2) y ~ x1 + x2 + x3 + ... | z | d1 + d2 + d3 + ...

Here, d1 + d2 + d3 + ... are indicator ("dummy") variables that will be adjusted so that the decomposition results do not change depending on the user’s choice of the reference category (Gardeazabal and Ugidos, 2004).

I cannot run formula 2), but I can run formula 1) when I delete a couple of dummy variables, otherwise I get an error.

I have 5 levels (separate variables) for Age (1st - 14 to 19, 2nd - 20 to 29, 3rd - 30 to 39, etc.), 4 levels for tenure (1st - 0 to 2, 1nd - 2 to 4 years), 15 levels for Industry, 63 levels for Occupation, etc. I am going to call Age, Tenure, Industry and Occupation my different 'types' that should each have their own reference category ommitted from the formula.

Since I use a lot of 'types' of indicator variables, what I don't understand is, how does 'R' recognize which reference category belongs to which 'type'? Maybe 'R' reads all "dummy" variables as levels of the same 'type', and selects only 1 ommitted variable as reference category for all the variables?

Is there any way that you know in which I could adapt my data to specify the correct reference category for each 'type'? Judging by the example with 'Chicago' dataframe it seems like I have too many different 'types' of variables for this formula to work.

The original data I have is from the Lithuanian Structure of Earnings Survey 2014. I have created new data in excel (later converted to a .csv file) using the original, following the example of the 'Chicago' dataframe, used in the 'Oaxaca' package example. The data created is mostly made of dummy variables with the values of "0" or "1", except for the Hours column, which contains hours worked in a month, and the log.wage column, containing the natural logarythm of the hourly wage. Everything else is indicator variables. However, these indicator variables belong to different types, as mentioned already, such as Age, Tenure, etc.

I have been unsuccessful in trying to manipulate the original dataset to create indicator variables using 'R', because I need to create specific new variables from a variety of the existing ones, for example all the occupations coded 431 and 432 should be merged into 1 variable titled 'prof43'. I have not found out how to do this so far.

My data contains mostly indicator variables and the variable types look like this:


str(S14)

'data.frame':   44952 obs. of  71 variables:
 $ hours    : int  1 1 1 1 2 1 1 2 1 1 ...
 $ female   : int  0 1 1 1 0 0 0 1 0 0 ...
 $ age0     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age1     : int  1 1 0 0 0 0 0 1 1 0 ...
 $ age2     : int  0 0 0 1 0 1 0 0 0 0 ...
 $ age3     : int  0 0 1 0 1 0 0 0 0 1 ...
 $ age4     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age5     : int  0 0 0 0 0 0 1 0 0 0 ...
 $ prof11   : int  0 0 0 0 0 0 0 0 0 0 ...
......
 $ prof96   : int  0 0 0 0 1 0 0 0 0 0 ...
 $ edu1     : int  0 0 0 0 0 0 0 0 1 0 ...
 $ edu2     : int  0 1 0 0 1 1 0 1 0 1 ...
 $ edu3     : int  1 0 1 1 0 0 1 0 0 0 ...
 $ ten1     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ ten2     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ten3     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ten4     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ size1to50: int  1 1 0 1 1 1 0 1 1 1 ...
 $ nace1    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nace2    : int  0 0 0 0 0 0 0 0 0 0 ...
......
 $ nace15   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pubcon   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ temp     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ log.wage : num  1.79 1.79 1.79 1.79 1.79 ...

I run the 'Oaxaca' function using these codes:

library(oaxaca)
set.seed(03104)        #random seed

I get results from this, yet I doubt their validity due to the fact that I delete 1 non-zero indicator variable (prof 62) (otherwise it doesn't run):

results0 <- oaxaca(log.wage ~ hours + pubcon + temp + size1to50 + age0 + age1 + age2 + 
age4 + age5 + ten1 + ten2 + ten4 + edu1 + edu3 + prof11 + prof12 + ..... + 
prof96 + nace1 + nace2 + ... + nace14 | female, data = S14, R = 30)           
# 1) y ~ x1 + x2 + x3 + ... | z

The code that gets the error message for me:

results1 <- oaxaca(log.wage ~ hours + pubcon + temp + size1to50 + 
age0 + age1 + age2 + age4 + age5 + ten1 + ten2 + ten4 + edu1 + edu3 + 
prof11 + prof12 + ..... + prof96 + nace1 + nace2 + ... + nace14 | female | 
pubcon + temp + size1to50 + age0 + age1 + age2 + age4 + age5 + ten1 + ten2 + 
ten4 + edu1 + edu3 + prof11 + prof12 + ..... + prof96 + nace1 + nace2 + ... + nace14, 
data = S14, R = 30)        # 2) y ~ x1 + x2 + x3 + ... | z | d1 + d2 + d3 + ...

Running this, I get the error message:

Variables d1 + d2 + d3 + ... in argument 'formula' must indicate membership in mutually exclusive categories.

Does anyone have any suggestions? Do you think using the original dataset and sorting it into indicator variables using 'R' would work, and I could select the reference category which the function 'formula' would recognize?

If so, what package and formulas do you suggest using to adapt my data?

Or do you think I am using too many variables for this 'Oaxaca' package and I should restrict my data?

Also, do the resulsts I get with formula 1) make sense? I am worried that 'R' does not choose the correct reference category for each 'type' of variable set resulting in all the indicator variables being dependent on some random ommitted variable, which would make the results nonsensical.

Excuse my lengthly ramblings, I hope I made some sense and if anyone has any experience of working with the 'Oaxaca' package or any ideas on what to do here and want to voice them - I am extremely grateful in advance!

like image 450
Paulina Avatar asked Sep 05 '25 02:09

Paulina


1 Answers

I wrote the creator of the package and sent him a link to this page since I was having the same problem. Here is what he responded: "It looks like you are trying to include several (independent) sets of dummy variables, hence the error message. The oaxaca package, unfortunately, does not support this."

For what it's worth, it appears like oaxaca in Stata does support this, if you are looking for an alternative.

like image 178
Ella Wind Avatar answered Sep 07 '25 22:09

Ella Wind



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!