It is surprisingly challenging to find out how to estimate a bivariate probit model in R!

I recently faced the challenge of having to model two binary responses. My immediate reaction is to use two separate logit or probit models to explain each of the binary responses (i.e. a 1 rather than 0). The problem is that the two responses are correlated: the answer to the first one will likely impact the answer to the second question. One way to model simultaneously the two responses is to use the bivariate probit model.

It was a jungle to find out a package to estimate a bivariate probit model. This blog post intends to summarize this jungle, present the options in terms of packages and which solution I found. Long story short: the most viable option to implement a bivariate probit model is the GJRM package. A good explanation of the trouble of finding an appropriate package is summarized here.

When I started looking into a R package to estimate this model, I found multiple options, seemingly identical. I found three package alternatives: the VGAM package, the ZeligChoice package, and the GJRM package.

The first option I found was the VGAM package. The function to run a bivariate probit model is vglm, wherein we specify the family to be binom2.rho. I run below the example provided by the VGAM package to see how to results look like. I use the same example as shown in the package’s documentation: the coalminers dataset (with 9 observations).

> library(VGAM)
Loading required package: stats4
Loading required package: splines
Warning message:
package ‘VGAM’ was built under R version 4.1.3 
> coalminers <- transform(coalminers, Age = (age - 42) / 5)

> fit <- vglm(cbind(nBnW, nBW, BnW, BW) ~ Age,
+             binom2.rho, data = coalminers, trace = TRUE)
VGLM    linear loop  1 :  loglikelihood = -95.59942
VGLM    linear loop  2 :  loglikelihood = -95.59848
VGLM    linear loop  3 :  loglikelihood = -95.59848
> summary(fit)

Call:
vglm(formula = cbind(nBnW, nBW, BnW, BW) ~ Age, family = binom2.rho, 
    data = coalminers, trace = TRUE)

Coefficients: 
               Estimate Std. Error z value Pr(>|z|)    
(Intercept):1 -1.279160   0.014435  -88.61   <2e-16 ***
(Intercept):2 -0.881462   0.011242  -78.41   <2e-16 ***
(Intercept):3  2.044267   0.043254   47.26   <2e-16 ***
Age:1          0.273350   0.006178   44.25   <2e-16 ***
Age:2          0.184643   0.004901   37.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Names of linear predictors: probitlink(mu1), probitlink(mu2), rhobitlink(rho)

Log-likelihood: -95.5985 on 22 degrees of freedom

Number of Fisher scoring iterations: 3 

No Hauck-Donner effect found in any of the estimates

The code above works fine and explains 4 variables using Age. The problem is that, if I want to have different explanatory variables explaining different dependent variables, the code does not work. For example, if I want to explain two of the responses with age squared rather than age, I would write the following:

> fit <- vglm(list(cbind(nBnW, nBW) ~ Age, cbind(BnW, BW) ~ Age_sqr),
+             family = binom2.rho, data = coalminers, trace = TRUE)
Error in formula.default(object, env = baseenv()) : invalid formula

So the code simply does not work. In the package’s defense, the documentation never claims to allow for different explanatory variables for distinct responses. There is also no documentation as to how explanatory variables may differ for distinct dependent variables.

This problem carries over to the second package option I found. The Zeligverse or the Zelig package can also be used to estimate bivariate probit models. Beware that it requires an additional package to estimate logit and probit models: ZeligChoice (as explained here). Even though the documentation says it is possible, their function only works if the explanatory variables are the same for both binary dependent variables. As explained in Github, “support for multiple right-hand side equations was removed in Zelig 5”.

For illustration, let me show the output when running the example code that uses the sanction dataset. This is the example taken from the supporting documentation for the Zelig package to estimate a Model with different sets of explanatory variables.

> data(sanction)
> fml2 <- list(mu1 = import ~ coop, mu2 = export ~ cost + target)
> z.out2 <- zelig(fml2, model = "bprobit", data = sanction)
Error in formula.default(object, env = baseenv()) : invalid formula
> summary(z.out2)
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'object' in selecting a method for function 'summary': object 'z.out2' not found

Similarly to the VGAM package, it does not seem possible to run a bivariate probit model with different sets of explanatory variables.

Fortunately, someone else had the same problem and summarized the whole “jungle” of finding a package that estimates a bivariate probit model. Which lead me to find the third option: the GJRM package.

In three case of the GJRM package, the function to use is gjrm. The syntax to allow for different explanatory variables is somewhat different but fairly intuitive. In the formula field, one specifies a list object with two observations: list(y1 ~ x1 + x2 + x3, + y2 ~ x1 + x2 + x3). y1 and y2 are the two dependent variables and x1 to x3 are the explanatory variables. Again, let me use the example provided by the official documentation of the GJRM package which first simulates a dataset and then estimates a bivariate probit model with the simulated data:

> library(GJRM)
> set.seed(0)
> n <- 400
> Sigma <- matrix(0.5, 2, 2); diag(Sigma) <- 1
> u <- rMVN(n, rep(0,2), Sigma)
> x1 <- round(runif(n)); x2 <- runif(n); x3 <- runif(n)
> f1 <- function(x) cos(pi*2*x) + sin(pi*x)
> f2 <- function(x) x+exp(-30*(x-0.5)^2)
> y1 <- ifelse(-1.55 + 2*x1 + f1(x2) + u[,1] > 0, 1, 0)
> y2 <- ifelse(-0.25 - 1.25*x1 + f2(x2) + u[,2] > 0, 1, 0)
> dataSim <- data.frame(y1, y2, x1, x2, x3)

> ## CLASSIC BIVARIATE PROBIT
> out <- gjrm(list(y1 ~ x1 + x2 + x3,
+                  y2 ~ x1 + x2 + x3),
+             data = dataSim,
+             margins = c("probit", "probit"),
+             Model = "B")
> conv.check(out)

Largest absolute gradient value: 1.363631e-13
Observed information matrix is positive definite
Eigenvalue range: [9.528687,419.9106]

Trust region iterations: 3
Estimated overall probability range: 0.08777689 0.9171866
Estimated overall density range: 0.1915345 0.9836851

> summary(out)

COPULA:   Gaussian
MARGIN 1: Bernoulli
MARGIN 2: Bernoulli

EQUATION 1
Link function for mu.1: probit 
Formula: y1 ~ x1 + x2 + x3

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.94936    0.20532  -4.624 3.77e-06 ***
x1           1.97383    0.15150  13.028  < 2e-16 ***
x2           0.20574    0.25474   0.808    0.419    
x3          -0.06393    0.25939  -0.246    0.805    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


EQUATION 2
Link function for mu.2: probit 
Formula: y2 ~ x1 + x2 + x3

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.04474    0.17736   0.252  0.80084    
x1          -1.07918    0.13393  -8.058 7.76e-16 ***
x2           0.66869    0.23016   2.905  0.00367 ** 
x3           0.08599    0.23228   0.370  0.71124    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


n = 400  theta = 0.282(0.0492,0.45)  tau = 0.182(0.0313,0.297)
total edf = 9

The example code does not specify different explanatory variables for different dependent variables. I will run the code again specifying that y1 depends on x1 and x3, while y2 depends on x2 and x3:

>  out <- gjrm(list(y1 ~ x1  + x3,
+                   y2 ~  x2 + x3),
+                 data = dataSim,
+                 margins = c("probit", "probit"),
+                 Model = "B")
> conv.check(out)

Largest absolute gradient value: 1.006716e-10
Observed information matrix is positive definite
Eigenvalue range: [11.09104,403.8807]

Trust region iterations: 4
Estimated overall probability range: 0.1329729 0.9174455
Estimated overall density range: 0.1932023 1.132737

> summary(out)

COPULA:   Gaussian
MARGIN 1: Bernoulli
MARGIN 2: Bernoulli

EQUATION 1
Link function for mu.1: probit 
Formula: y1 ~ x1 + x3

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.93902    0.16427  -5.716 1.09e-08 ***
x1           2.14406    0.15800  13.570  < 2e-16 ***
x3          -0.05381    0.25843  -0.208    0.835    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


EQUATION 2
Link function for mu.2: probit 
Formula: y2 ~ x2 + x3

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.44800    0.15729  -2.848  0.00440 **
x2           0.63737    0.21517   2.962  0.00306 **
x3           0.08055    0.22077   0.365  0.71522   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


n = 400  theta = 0.294(0.0802,0.473)  tau = 0.19(0.0511,0.314)
total edf = 7

The code works fine allowing different explanatory variables to explain the multiple dependent variables. I will be using this package in the future to estimate my own bivariate probit model.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s