I recently faced the challenge of having to model two binary responses. My immediate reaction is to use two separate logit or probit models to explain each of the binary responses (i.e. a 1 rather than 0). The problem is that the two responses are correlated: the answer to the first one will likely impact the answer to the second question. One way to model simultaneously the two responses is to use the bivariate probit model.

It was a jungle to find out a package to estimate a bivariate probit model. This blog post intends to summarize this jungle, present the options in terms of packages and which solution I found. Long story short: the most viable option to implement a bivariate probit model is the GJRM package. A good explanation of the trouble of finding an appropriate package is summarized here.

When I started looking into a R package to estimate this model, I found multiple options, seemingly identical. I found three package alternatives: the VGAM package, the ZeligChoice package, and the GJRM package.

The first option I found was the VGAM package. The function to run a bivariate probit model is *vglm*, wherein we specify the family to be *binom2.rho*. I run below the example provided by the VGAM package to see how to results look like. I use the same example as shown in the package’s documentation: the coalminers dataset (with 9 observations).

```
> library(VGAM)
Loading required package: stats4
Loading required package: splines
Warning message:
package ‘VGAM’ was built under R version 4.1.3
> coalminers <- transform(coalminers, Age = (age - 42) / 5)
> fit <- vglm(cbind(nBnW, nBW, BnW, BW) ~ Age,
+ binom2.rho, data = coalminers, trace = TRUE)
VGLM linear loop 1 : loglikelihood = -95.59942
VGLM linear loop 2 : loglikelihood = -95.59848
VGLM linear loop 3 : loglikelihood = -95.59848
> summary(fit)
Call:
vglm(formula = cbind(nBnW, nBW, BnW, BW) ~ Age, family = binom2.rho,
data = coalminers, trace = TRUE)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -1.279160 0.014435 -88.61 <2e-16 ***
(Intercept):2 -0.881462 0.011242 -78.41 <2e-16 ***
(Intercept):3 2.044267 0.043254 47.26 <2e-16 ***
Age:1 0.273350 0.006178 44.25 <2e-16 ***
Age:2 0.184643 0.004901 37.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Names of linear predictors: probitlink(mu1), probitlink(mu2), rhobitlink(rho)
Log-likelihood: -95.5985 on 22 degrees of freedom
Number of Fisher scoring iterations: 3
No Hauck-Donner effect found in any of the estimates
```

The code above works fine and explains 4 variables using Age. The problem is that, if I want to have different explanatory variables explaining different dependent variables, the code does not work. For example, if I want to explain two of the responses with age squared rather than age, I would write the following:

```
> fit <- vglm(list(cbind(nBnW, nBW) ~ Age, cbind(BnW, BW) ~ Age_sqr),
+ family = binom2.rho, data = coalminers, trace = TRUE)
Error in formula.default(object, env = baseenv()) : invalid formula
```

So the code simply does not work. In the package’s defense, the documentation never claims to allow for different explanatory variables for distinct responses. There is also no documentation as to how explanatory variables may differ for distinct dependent variables.

This problem carries over to the second package option I found. The Zeligverse or the Zelig package can also be used to estimate bivariate probit models. Beware that it requires an additional package to estimate logit and probit models: ZeligChoice (as explained here). Even though the documentation says it is possible, their function only works if the explanatory variables are the same for both binary dependent variables. As explained in Github, “support for multiple right-hand side equations was removed in Zelig 5”.

For illustration, let me show the output when running the example code that uses the sanction dataset. This is the example taken from the supporting documentation for the Zelig package to estimate a Model with different sets of explanatory variables.

```
> data(sanction)
> fml2 <- list(mu1 = import ~ coop, mu2 = export ~ cost + target)
> z.out2 <- zelig(fml2, model = "bprobit", data = sanction)
Error in formula.default(object, env = baseenv()) : invalid formula
> summary(z.out2)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': object 'z.out2' not found
```

Similarly to the VGAM package, it does not seem possible to run a bivariate probit model with different sets of explanatory variables.

Fortunately, someone else had the same problem and summarized the whole “jungle” of finding a package that estimates a bivariate probit model. Which lead me to find the third option: the GJRM package.

In three case of the GJRM package, the function to use is *gjrm*. The syntax to allow for different explanatory variables is somewhat different but fairly intuitive. In the *formula* field, one specifies a *list* object with two observations: list(y1 ~ x1 + x2 + x3, + y2 ~ x1 + x2 + x3). *y1* and *y2 *are the two dependent variables and *x1* to *x3* are the explanatory variables. Again, let me use the example provided by the official documentation of the GJRM package which first simulates a dataset and then estimates a bivariate probit model with the simulated data:

```
> library(GJRM)
> set.seed(0)
> n <- 400
> Sigma <- matrix(0.5, 2, 2); diag(Sigma) <- 1
> u <- rMVN(n, rep(0,2), Sigma)
> x1 <- round(runif(n)); x2 <- runif(n); x3 <- runif(n)
> f1 <- function(x) cos(pi*2*x) + sin(pi*x)
> f2 <- function(x) x+exp(-30*(x-0.5)^2)
> y1 <- ifelse(-1.55 + 2*x1 + f1(x2) + u[,1] > 0, 1, 0)
> y2 <- ifelse(-0.25 - 1.25*x1 + f2(x2) + u[,2] > 0, 1, 0)
> dataSim <- data.frame(y1, y2, x1, x2, x3)
> ## CLASSIC BIVARIATE PROBIT
> out <- gjrm(list(y1 ~ x1 + x2 + x3,
+ y2 ~ x1 + x2 + x3),
+ data = dataSim,
+ margins = c("probit", "probit"),
+ Model = "B")
> conv.check(out)
Largest absolute gradient value: 1.363631e-13
Observed information matrix is positive definite
Eigenvalue range: [9.528687,419.9106]
Trust region iterations: 3
Estimated overall probability range: 0.08777689 0.9171866
Estimated overall density range: 0.1915345 0.9836851
> summary(out)
COPULA: Gaussian
MARGIN 1: Bernoulli
MARGIN 2: Bernoulli
EQUATION 1
Link function for mu.1: probit
Formula: y1 ~ x1 + x2 + x3
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.94936 0.20532 -4.624 3.77e-06 ***
x1 1.97383 0.15150 13.028 < 2e-16 ***
x2 0.20574 0.25474 0.808 0.419
x3 -0.06393 0.25939 -0.246 0.805
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
EQUATION 2
Link function for mu.2: probit
Formula: y2 ~ x1 + x2 + x3
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.04474 0.17736 0.252 0.80084
x1 -1.07918 0.13393 -8.058 7.76e-16 ***
x2 0.66869 0.23016 2.905 0.00367 **
x3 0.08599 0.23228 0.370 0.71124
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
n = 400 theta = 0.282(0.0492,0.45) tau = 0.182(0.0313,0.297)
total edf = 9
```

The example code does not specify different explanatory variables for different dependent variables. I will run the code again specifying that y1 depends on x1 and x3, while y2 depends on x2 and x3:

```
> out <- gjrm(list(y1 ~ x1 + x3,
+ y2 ~ x2 + x3),
+ data = dataSim,
+ margins = c("probit", "probit"),
+ Model = "B")
> conv.check(out)
Largest absolute gradient value: 1.006716e-10
Observed information matrix is positive definite
Eigenvalue range: [11.09104,403.8807]
Trust region iterations: 4
Estimated overall probability range: 0.1329729 0.9174455
Estimated overall density range: 0.1932023 1.132737
> summary(out)
COPULA: Gaussian
MARGIN 1: Bernoulli
MARGIN 2: Bernoulli
EQUATION 1
Link function for mu.1: probit
Formula: y1 ~ x1 + x3
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.93902 0.16427 -5.716 1.09e-08 ***
x1 2.14406 0.15800 13.570 < 2e-16 ***
x3 -0.05381 0.25843 -0.208 0.835
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
EQUATION 2
Link function for mu.2: probit
Formula: y2 ~ x2 + x3
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.44800 0.15729 -2.848 0.00440 **
x2 0.63737 0.21517 2.962 0.00306 **
x3 0.08055 0.22077 0.365 0.71522
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
n = 400 theta = 0.294(0.0802,0.473) tau = 0.19(0.0511,0.314)
total edf = 7
```

The code works fine allowing different explanatory variables to explain the multiple dependent variables. I will be using this package in the future to estimate my own bivariate probit model.