In this tutorial, you will learn how, under the presence of a
right-censored covariance, we can simulate data and estimate the linear
regression parameters as discussed in the paper “**Establishing
the Parallels and Differences Between Right-Censored and Missing
Covariates”** found here Link. More
specifically, in this tutorial, you will health how to use the
`data_mvn()`

function to generate data and estimate the
parameters of interest using the complete case (CC), inverse probability
weighting (IPW), maximum likelihood (MLE), augmented CC (ACC), modified
ACC (MACC), or/and the augmented IPW (AIPW) estimator. The GitHub
repository for this tutorial can be accessed at GitHub.

Throughout, we are interested in the linear regression model

\[ y_i = \beta_0 + \beta_1 (A_i-X_i) + \beta_2 Z_i + \epsilon_i \]

where \(y_i\) represents the
outcome, \(A_i\) the current age, \(X_i\) the age at diagnosis, \(A_i - X_i\) the time to clinical diagnosis,
\(Z_i\) the fully observed covariate,
and \(\epsilon_i \sim N(0,\sigma^2)\) .
The data also provides \(W_i = min
(X_i,C_i)\) where \(C_i\) is the
random right-censoring variable, and \(W_i\) is the observed event time. The
variable \(D_i = I(X_i \leq C_i)\)
equals \(1\) if the observed event time
equals age at clinical diagnosis, and \(0\) if otherwise. The objective is to
estimate The goal is to estimate \(\theta =
(\beta_0, \beta_1, \beta_2, \sigma)\) using the observed data
\(O = (Y,W,\Delta,Z)\). Please refer to
the **Supplementary Material Section S.5** for the exact
details of the data simulation process and the main paper for more
details of the estimators and their robustness and efficiency
properties.

This vignette is divided into two main sections, each with two subsections. First, we will guide you through the process when the nuisance parameters are known, covering both independent and dependent covariate right-censoring. We will then repeat the process for when the nuisance parameters need to be estimated. Throughout, we assume that \(Y \perp C | X,Z\). The structure of this vignette is as follows:

**Known nuisance parameters**Independent covariate right-censoring (\(X \perp C | Z\))

Dependent covariate right-censoring (\(X \not\perp C | Z\))

**Unknown nuisance parameters**Independent covariate right-censoring (\(X \perp C | Z\))

Dependent covariate right-censoring (\(X \not\perp C | Z\))

The `data_mvn()`

function generates data from a trivariate
normal distribution. The trivariate normal distribution is used to
generate data under independent or dependent covariate right-censoring.
The inputs of the function are as follows,

`nSubjects = Integer`

, this is the total number of observations`dep_censoring = TRUE or FALSE`

, if`FALSE`

the partial correlation of \((X,C)\) given \(Z\) equals zero. If`TRUE`

, this partial correlation is not equal to zero.

In the following example, we simulate a sample size of 1,000 under
non-independent covariate right-censoring. As noted in the
**Supplementary Material Section S.5** to the paper
“**Establishing the Parallels and Differences Between
Right-Censored and Missing Covariates”**, the oracle mean and
standard deviation of the distribution for \(X\) are 0 and 1. For \(C\), these values are 0 and 2.

```
# generate data under independent covariate right censoring
set.seed(0)
n=1000
dep_censoring. = FALSE
dat = data_mvn(nSubjects = n, dep_censoring = dep_censoring.)
# visualize data
head(dat, 5) %>% paged_table()
```

Under independent covariate right-censoring we assume that \(X \perp C | Z\). We can check this assumption by comparing the marginal correlation of \((X,C)\) and the partial correlation given \(Z\).

```
# check bivariate correlations
marginal_cor = dat %>% select(X,C,Z) %>% cor() %>% round(2)
marginal_cor
```

```
## X C Z
## X 1.00 0.12 0.55
## C 0.12 1.00 0.26
## Z 0.55 0.26 1.00
```

```
# check the partial correlation between (X,C) given Z
partial_cor = dat %>% select(X,C,Z) %>% ppcor::pcor()
partial_cor$estimate %>% round(2)
```

```
## X C Z
## X 1.00 -0.03 0.55
## C -0.03 1.00 0.23
## Z 0.55 0.23 1.00
```

As expected, the marginal correlation between \((X,C)\) was 0.12, but only -0.03 when conditional on \(Z\). This confirms that in our simulation setup, \(X\) and \(C\) are only independent conditionally on \(Z\). Moreover, under this simulation scenario, the expected right-censoring rate is 50%, and in this particular simulation, the right-censoring rate is 0.52. The distribution of \((X,C)\) graphically illustrated below:

```
dat %>%
mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>%
ggplot(aes(x = X, y = C, colour = D)) + geom_point() +
theme_minimal() +
geom_abline(intercept = 0, slope = 1) +
# Shaded area above the diagonal line X = C
geom_polygon(data = data.frame(x = c(-5, -5, 5), y = c(-5, 5, 5)),
aes(x = x, y = y), fill = "#00BFC4", alpha = 0.1, inherit.aes = FALSE) +
# Shaded area below the diagonal line X = C
geom_polygon(data = data.frame(x = c(-5, 5, 5), y = c(-5, -5, 5)),
aes(x = x, y = y), fill = "#F8766D", alpha = 0.1, inherit.aes = FALSE) +
# add annotations for mean and sd
# annotate("text", x = 3, y = -4,
# label = paste0("Mean X: ", round(mean(dat$X), 2), "\nSD X: ", round(var(dat$X)^0.5, 2)),
# hjust = 0, size = 4, color = "black") +
# annotate("text", x = -4.5, y = 4,
# label = paste0("Mean C: ", round(mean(dat$C), 2), "\nSD C: ", round(var(dat$C)^0.5, 2)),
# hjust = 0, size = 4, color = "black") +
ylim(-5, 5) + xlim(-5, 5) +
labs(title = "Scatterplot of X vs. C", x = "X", y = "C", colour = "") +
theme(legend.position = "bottom")
```