1 Introduction

In this tutorial, you will learn how, under the presence of a right-censored covariance, we can simulate data and estimate the linear regression parameters as discussed in the paper “Establishing the Parallels and Differences Between Right-Censored and Missing Covariates” found here Link. More specifically, in this tutorial, you will health how to use the data_mvn() function to generate data and estimate the parameters of interest using the complete case (CC), inverse probability weighting (IPW), maximum likelihood (MLE), augmented CC (ACC), modified ACC (MACC), or/and the augmented IPW (AIPW) estimator. The GitHub repository for this tutorial can be accessed at GitHub.

Throughout, we are interested in the linear regression model

\[ y_i = \beta_0 + \beta_1 (A_i-X_i) + \beta_2 Z_i + \epsilon_i \]

where \(y_i\) represents the outcome, \(A_i\) the current age, \(X_i\) the age at diagnosis, \(A_i - X_i\) the time to clinical diagnosis, \(Z_i\) the fully observed covariate, and \(\epsilon_i \sim N(0,\sigma^2)\) . The data also provides \(W_i = min (X_i,C_i)\) where \(C_i\) is the random right-censoring variable, and \(W_i\) is the observed event time. The variable \(D_i = I(X_i \leq C_i)\) equals \(1\) if the observed event time equals age at clinical diagnosis, and \(0\) if otherwise. The objective is to estimate The goal is to estimate \(\theta = (\beta_0, \beta_1, \beta_2, \sigma)\) using the observed data \(O = (Y,W,\Delta,Z)\). Please refer to the Supplementary Material Section S.5 for the exact details of the data simulation process and the main paper for more details of the estimators and their robustness and efficiency properties.

This vignette is divided into two main sections, each with two subsections. First, we will guide you through the process when the nuisance parameters are known, covering both independent and dependent covariate right-censoring. We will then repeat the process for when the nuisance parameters need to be estimated. Throughout, we assume that \(Y \perp C | X,Z\). The structure of this vignette is as follows:

Known nuisance parameters
- Independent covariate right-censoring (\(X \perp C | Z\))
- Dependent covariate right-censoring (\(X \not\perp C | Z\))
Unknown nuisance parameters
- Independent covariate right-censoring (\(X \perp C | Z\))
- Dependent covariate right-censoring (\(X \not\perp C | Z\))

2 Known nuisance parameters

2.1 Independent covariate right-censoring

2.1.1 Generate data

The data_mvn() function generates data from a trivariate normal distribution. The trivariate normal distribution is used to generate data under independent or dependent covariate right-censoring. The inputs of the function are as follows,

nSubjects = Integer, this is the total number of observations
dep_censoring = TRUE or FALSE, if FALSE the partial correlation of \((X,C)\) given \(Z\) equals zero. If TRUE, this partial correlation is not equal to zero.

In the following example, we simulate a sample size of 1,000 under non-independent covariate right-censoring. As noted in the Supplementary Material Section S.5 to the paper “Establishing the Parallels and Differences Between Right-Censored and Missing Covariates”, the oracle mean and standard deviation of the distribution for \(X\) are 0 and 1. For \(C\), these values are 0 and 2.

# generate data under independent covariate right censoring
set.seed(0)
n=1000
dep_censoring. = FALSE
dat = data_mvn(nSubjects = n, dep_censoring = dep_censoring.)

# visualize data
head(dat, 5) %>% paged_table()

Under independent covariate right-censoring we assume that \(X \perp C | Z\). We can check this assumption by comparing the marginal correlation of \((X,C)\) and the partial correlation given \(Z\).

# check bivariate correlations 
marginal_cor = dat %>% select(X,C,Z) %>% cor() %>% round(2)
marginal_cor

##      X    C    Z
## X 1.00 0.12 0.55
## C 0.12 1.00 0.26
## Z 0.55 0.26 1.00

# check the partial correlation between (X,C) given Z
partial_cor = dat %>% select(X,C,Z) %>% ppcor::pcor() 
partial_cor$estimate %>% round(2)

##       X     C    Z
## X  1.00 -0.03 0.55
## C -0.03  1.00 0.23
## Z  0.55  0.23 1.00

As expected, the marginal correlation between \((X,C)\) was 0.12, but only -0.03 when conditional on \(Z\). This confirms that in our simulation setup, \(X\) and \(C\) are only independent conditionally on \(Z\). Moreover, under this simulation scenario, the expected right-censoring rate is 50%, and in this particular simulation, the right-censoring rate is 0.52. The distribution of \((X,C)\) graphically illustrated below:

dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>%
  ggplot(aes(x = X, y = C, colour = D)) + geom_point() +
  theme_minimal() +
  geom_abline(intercept = 0, slope = 1) +
  # Shaded area above the diagonal line X = C
  geom_polygon(data = data.frame(x = c(-5, -5, 5), y = c(-5, 5, 5)),
               aes(x = x, y = y), fill = "#00BFC4", alpha = 0.1, inherit.aes = FALSE) + 
  # Shaded area below the diagonal line X = C
  geom_polygon(data = data.frame(x = c(-5, 5, 5), y = c(-5, -5, 5)),
               aes(x = x, y = y), fill = "#F8766D", alpha = 0.1, inherit.aes = FALSE) +
  # add annotations for mean and sd
  # annotate("text", x = 3, y = -4, 
  #          label = paste0("Mean X: ", round(mean(dat$X), 2), "\nSD X: ", round(var(dat$X)^0.5, 2)), 
  #          hjust = 0, size = 4, color = "black") +
  # annotate("text", x = -4.5, y = 4, 
  #          label = paste0("Mean C: ", round(mean(dat$C), 2), "\nSD C: ", round(var(dat$C)^0.5, 2)), 
  #          hjust = 0, size = 4, color = "black") +
  ylim(-5, 5) + xlim(-5, 5) +
  labs(title = "Scatterplot of X vs. C", x = "X", y = "C", colour = "") +
  theme(legend.position = "bottom")

The values above the diagonal line correspond to \(X \leq C\), whiles values below the line correspond to \(C < X\). The shaded blue area represents the cases where we observe the value of X, indicating \(C > X\). On the other hand, the red shaded area represent the right-censored observations where the observed values are those from \(C\), indicating \(C < X\). We also plot the oracle probability of observing \(X\), defined by \(\pi_{X,Z}(w,z)\). The probability values become smaller the larger \(X\) is since a larger value of \(X\) corresponds to a higher probability that \(X\) is censored (blue scatter plot).

# scatter plot for pi(w,z)
dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>% 
  ggplot(aes(x= W ,y=myp_xz, colour =D)) + geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatter plot of \u03C0(w,z) vs. W", y = "\u03C0(w,z)", x = "W = min(X,C)", colour = "") +
  theme(legend.position="bottom")

## `geom_smooth()` using formula = 'y ~ x'

Additionally, we can plot the probability of \(X\) being observed as a function of \((Y,Z)\), i.e., \(\pi_{Y,Z}(y,z)\). The values for the probability \(\pi_{Y,Z}(y,z)\) are similar to those defined by \((W,Z)\). In fact, the correlation between these values is 0.93. Thus, while not the same these probability values are close.

# scatter plot for pi(y,z)
dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>%
  ggplot(aes(x= myp_xz ,y=myp_yz, colour =D)) +
  geom_point() + theme_minimal() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatter plot of \u03C0(w,z) vs. \u03C0(y,z)", y = "\u03C0(y,z)", x = "\u03C0(w,z)",  colour = "") +
  theme(legend.position="bottom")

## `geom_smooth()` using formula = 'y ~ x'

2.1.2 Estimate parameters

Now that we have generated the data, we will estimate the linear regression parameters \(\btheta\) and their corresponding standard error estimates. Before that, we will calculate some probabilities from a uniform distribution to misspecify the probability of \(X\) being observed for the IPW, ACC, MACC, and AIPW estimators. Estimates from one simulation can be obtained using the following code:

tic()
##################################
# uniform (0.1, 0.9) distribution
dat$myp_uniform = runif(n, min=0.1, max = 0.9)

##################################
# Oracle
est00 = estimate_beta(data_yXZ = dat, model = y ~ AX + Z) # oracle

# Naive
est10 = estimate_beta(data_yXZ = dat, model = y ~ AW + Z) # naive

# CC
est20 = estimate_beta_cc(data_yXZ = dat, model = y ~ AW + Z) # CC

# IPW
est30 = estimate_beta_ipw(data_yXZ = dat, model = y ~ AW+Z, myweight = "oracle") # oracle weight pi(w,z)
est31 = estimate_beta_ipw(data_yXZ = dat, model = y ~ AW+Z, myweight = "yz") # oracle weight pi(y,z)
est32 = estimate_beta_ipw(data_yXZ = dat, model = y ~ AW+Z, myweight = "uniform") # random weights

# MLE
est40 = estimate_beta_likelihood(dat, y ~ AW+Z, eta1= FALSE, dep_censoring = dep_censoring.) # correct specification of f(X,Z)
est41 = estimate_beta_likelihood(dat, y ~ AW+Z, eta1= TRUE, dep_censoring = dep_censoring.) # incorrect specification of f(X|Z)

# ACC
est50 = estimate_beta_acc(dat, y ~ AW+Z, eta1= FALSE, myweight = "oracle", dep_censoring = dep_censoring.)
est51 = estimate_beta_acc(dat, y ~ AW+Z, eta1= TRUE, myweight = "oracle", dep_censoring = dep_censoring.)
est52 = estimate_beta_acc(dat, y ~ AW+Z, eta1= FALSE, myweight = "uniform", dep_censoring = dep_censoring.)
est53 = estimate_beta_acc(dat, y ~ AW+Z, eta1= TRUE, myweight = "uniform", dep_censoring = dep_censoring.)
est54 = estimate_beta_acc_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est55 = estimate_beta_acc_lambda_close(dat, y ~ AW+Z, myweight = "uniform")

# MACC
est60 = estimate_beta_macc(dat, y ~ AW+Z, eta1= FALSE, myweight = "oracle", dep_censoring = dep_censoring.)
est61 = estimate_beta_macc(dat, y ~ AW+Z, eta1= TRUE, myweight = "oracle", dep_censoring = dep_censoring.)
est62 = estimate_beta_macc(dat, y ~ AW+Z, eta1= FALSE, myweight = "uniform", dep_censoring = dep_censoring.)
est63 = estimate_beta_macc(dat, y ~ AW+Z, eta1= TRUE, myweight = "uniform", dep_censoring = dep_censoring.)
est64 = estimate_beta_macc_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est65 = estimate_beta_macc_lambda_close(dat, y ~ AW+Z, myweight = "uniform")

# AIPW
est70 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= FALSE, myweight = "oracle", dep_censoring = dep_censoring.)
est71 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= TRUE, myweight = "oracle", dep_censoring = dep_censoring.)
est72 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= FALSE, myweight = "uniform", dep_censoring = dep_censoring.)
est73 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= TRUE, myweight = "uniform", dep_censoring = dep_censoring.)
est74 = estimate_beta_aipw_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est75 = estimate_beta_aipw_lambda_close(dat, y ~ AW+Z, myweight = "uniform")

toc()

# combine results

myreturn = data.frame(cbind(c("oracle","naive", "cc", 
                              "ipw-wz", "ipw-yz", "ipw-w", 
                              "acc", "acc-psi", "acc-w", "acc-psi-w", "acc-lambda", "acc-lambda-w", 
                              "macc", "macc-psi", "macc-w", "macc-psi-w", "macc-lambda", "macc-lambda-w", 
                              "aipw", "aipw-psi", "aipw-w", "aipw-psi-w", "aipw-lambda", "aipw-lambda-w"),
                            # beta estimates + sigma
                            rbind(est00$beta_est,est10$beta_est,est20$beta_est,
                                  est30$beta_est, est31$beta_est, est32$beta_est, 
                                  est50$beta_est, est51$beta_est, est52$beta_est, est53$beta_est, est54$beta_est, est55$beta_est,
                                  est60$beta_est, est61$beta_est, est62$beta_est, est63$beta_est, est64$beta_est, est65$beta_est,
                                  est70$beta_est, est71$beta_est, est62$beta_est, est73$beta_est, est74$beta_est, est75$beta_est),
                            # standard error estimates
                            rbind(est00$se_est,est10$se_est,est20$se_est,
                                  est30$se_est,est31$se_est,est32$se_est, 
                                  est50$se_est, est51$se_est, est52$se_est, est53$se_est, est54$se_est, est55$se_est,
                                  est60$se_est, est61$se_est, est62$se_est, est63$se_est, est64$se_est, est65$se_est,
                                  est70$se_est, est71$se_est, est72$se_est, est73$se_est, est74$se_est, est75$se_est)
  ))
colnames(myreturn) = c("Method",
                        "beta0.x", "beta1.x", "beta2.x","sigma.x",
                        "beta0.y", "beta1.y", "beta2.y")
  

myreturn_mle = data.frame(cbind(c("mle","mle-fxz"),
                            rbind(est40$beta_est, est41$beta_est),
                            rbind(est40$se_est, est41$se_est)))
colnames(myreturn_mle) = c("Method",
                        "beta0.x", "beta1.x", "beta2.x","sigma.x",
                        "beta0.y", "beta1.y", "beta2.y") 

myreturn = rbind(myreturn, myreturn_mle)
paged_table(myreturn)

Alternatively, we can use the function simcalc() to calculate one simulation. The inputs of the function are:

sim_i= Scalar, this value sets the seed for the simulation.
dep_censoring. = TRUE or FALSE, if FALSE the partial correlation of \((X,C)\) given \(Z\) equals zero. If TRUE, this partial correlation is not equal to zero.

tic(); 
myresults = parallel::mclapply(1:3, function(x) simcalc(x, dep_censoring. = dep_censoring.))
toc()

Instead of using parallel::mclapply(), one may also use lapply() to process the jobs. The only benefit of parallel::mclapply() is that the jobs are completed faster. What was done in the research paper is that each simulation was sent separately to the UNC Longleaf cluster, where all jobs were submitted simultaneously. These results were then saved and analyzed individually. We have included the simulation results in this repository. We now load these and replicate the results presented in the manuscript.

2.1.3 Simulation study results

#helper function to read dat
readmysims <-  function(list_est, size){
    myresults <- list_est %>% purrr::map_df(read.csv)
    myresults$n = size
    myresults$sim = unlist(lapply(1:length(list_est), function(x) rep(x, length(unique(myresults$Method)))))
    badlist = myresults %>%
              subset(Method %!in% c("naive")) %>%
               subset(is.na(beta0.x) | is.na(beta1.x) | is.na(beta2.x) | is.na(sigma.x) |
                      is.na(beta0.y) | is.na(beta1.y) | is.na(beta2.y) |
                       abs(beta0.x-theta[1])/theta[1] >= est_threshold |
                       abs(beta1.x-theta[2])/theta[2] >= est_threshold |
                       abs(beta2.x-theta[3])/theta[3] >= est_threshold |
                       abs(sigma.x-theta[4])/theta[4] >= est_threshold |
                       beta0.y > 1 | beta1.y > 1 | beta2.y > 1)
    myresults = myresults %>% subset(sim %!in% badlist$sim)
    return(myresults)
}

est_threshold = 3 # threshold use to exclude simulations, i.e., remove if bias is larger than 300%
theta = c(1,2,3,1) # oracle values

# Load files from folder
# Sample size 100
list_est <- list.files(path = paste0("../01-Data/Known nuisance parameters/Independent/100-V3"), pattern = "est_results*", full.names = T)
myresults_a = readmysims(list_est, 100)

# Sample size 300
list_est <- list.files(path = paste0("../01-Data/Known nuisance parameters/Independent/300-V3"), pattern = "est_results*", full.names = T)
myresults_b = readmysims(list_est, 300)

# Sample size 1000
list_est <- list.files(path = paste0("../01-Data/Known nuisance parameters/Independent/1000-V3"), pattern = "est_results*", full.names = T)
myresults_c = readmysims(list_est, 1000)

# combine and format
myresults = rbind(myresults_a, myresults_b, myresults_c) %>%
            data.frame() %>%
             mutate(Method =
    factor(Method,
          levels = c("oracle", "cc", "naive",
                    "ipw", "ipw-w" , "ipw-hat", "ipw-yz",
                    "acc","acc-eta1", "acc-w","acc-eta1-w", "acc-lambda", "acc-lambda-hat", "acc-lambda-eta1", "acc-lambda-w",  "acc-lambda-eta1-w",
                    "macc","macc-eta1", "macc-w","macc-eta1-w", "macc-lambda", "macc-lambda-hat", "macc-lambda-eta1", "macc-lambda-w",  "macc-lambda-eta1-w",
                    "aipw", "aipw-eta1", "aipw-w", "aipw-eta1-w" ,"aipw-lambda", "aipw-lambda-hat","aipw-lambda-eta1", "aipw-lambda-w","aipw-lambda-eta1-w",

                    "mle", "mle-w"))) %>%
  mutate(Method.c = ifelse(Method %in% c("ipw", "ipw-w" , "ipw-hat", "ipw-yz"), "ipw",
                      ifelse(Method %in% c("aipw", "aipw-eta1", "aipw-w", "aipw-eta1-w" ,"aipw-lambda", "aipw-lambda-hat","aipw-lambda-eta1", "aipw-lambda-w","aipw-lambda-eta1-w"), "aipw",
                      ifelse(Method %in% c("acc","acc-eta1", "acc-w","acc-eta1-w", "acc-lambda", "acc-lambda-hat", "acc-lambda-eta1", "acc-lambda-w",  "acc-lambda-eta1-w"), "acc",
                      ifelse(Method %in% c("macc","macc-eta1", "macc-w","macc-eta1-w", "macc-lambda", "macc-lambda-hat", "macc-lambda-eta1", "macc-lambda-w",  "macc-lambda-eta1-w"), "macc",
                      ifelse(Method %in% c("mle","mle-w"), "mle",
                      ifelse(Method == "oracle", "oracle",
                      ifelse(Method == "cc", "cc", "naive")) )) )))) %>%
  mutate(Method.c = ifelse(Method == "naive", "naive", Method.c)) %>%
    mutate(Method.c = factor(Method.c, levels = c("oracle", "cc", "ipw", "acc", "macc", "aipw","mle") ))

After the jobs have been completed, we can compute the mean estimate of \(\theta\) (i.e., \(N^{-1}\sum_{i=1}^N\hat{\theta}_i\)) and its percent bias (i.e., \(N^{-1}\sum_{i=1}^N (\hat{\theta}_i-\theta_0)/\theta_0\)); the empirical standard deviation of \(\hat\theta\) across all simulations; and the empirical mean of the estimated standard error (i.e., \(N^{-1}\sum_{i=1}^{N} \hat{\rm SE}_i\)). Estimated standard errors were computed using the asymptotic variances derived from our theorems, with all expectations replaced by empirical averages. Lastly, we can calculate the empirical coverage of the estimated 95% confidence intervals.

myresults = myresults %>% 
  mutate(beta0.x = as.numeric(beta0.x),
         beta1.x = as.numeric(beta1.x),
         beta2.x = as.numeric(beta2.x),
         sigma.x = as.numeric(sigma.x),
         beta0.y = as.numeric(beta0.y),
         beta1.y = as.numeric(beta1.y),
         beta2.y = as.numeric(beta2.y)) %>%
  # subset only those simulations that
  filter(abs(beta0.x-theta[1])/theta[1] < est_threshold &
           abs(beta1.x-theta[2])/theta[2] < est_threshold &
           abs(beta2.x-theta[3])/theta[3] < est_threshold &
           abs(sigma.x-theta[4])/theta[4] < est_threshold) %>%
  na.omit()

# Calculate bias 
myresults$beta0_coverage <- ((myresults$beta0.x + qnorm(0.975)*myresults$beta0.y) >= theta[1]) & 
  ((myresults$beta0.x - qnorm(0.975)*myresults$beta0.y) <= theta[1])
myresults$beta1_coverage <- ((myresults$beta1.x + qnorm(0.975)*myresults$beta1.y) >= theta[2]) & 
  ((myresults$beta1.x - qnorm(0.975)*myresults$beta1.y) <= theta[2])
myresults$beta2_coverage <- ((myresults$beta2.x + qnorm(0.975)*myresults$beta2.y) >= theta[3]) & 
  ((myresults$beta2.x - qnorm(0.975)*myresults$beta2.y) <= theta[3])

# mean estimates
myresults_est = myresults %>% 
  dplyr::select(Method,n,beta0.x, beta1.x, beta2.x, sigma.x) %>%
  group_by(Method, n) %>%
  summarise_all(function(x) mean(as.numeric(x)) %>% round(2))
myresults_est %>% paged_table()

# mean SE
myresults_se = myresults %>%
  dplyr::select(Method,n,beta0.y, beta1.y, beta2.y) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) mean(as.numeric(x)*100) %>% round(2))
myresults_se  %>% paged_table()

# empirical SD
myresults_sehat = myresults %>%
  dplyr::select(Method,n,beta0.x, beta1.x, beta2.x, sigma.x) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) round(100*var(as.numeric(x))^0.5, 5) %>% round(2))
myresults_sehat  %>% paged_table()

# coverage probabilities
myresults_coverage = myresults %>%
  dplyr::select(Method,n, beta0_coverage, beta1_coverage, beta2_coverage) %>%
  mutate(beta0_coverage = ifelse(beta0_coverage, 1, 0),
         beta1_coverage = ifelse(beta1_coverage, 1, 0),
         beta2_coverage = ifelse(beta2_coverage, 1, 0)) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) round(mean(x)*100,2))
myresults_coverage  %>% paged_table()

We can also recreate the figures, as presented in the Supplementary Material Section S.5.

## efficiency plots
# empirical SD
cc_sehat <- myresults %>%
  subset(Method %in% c("cc")) %>%
  dplyr::select(Method, n, beta0.x, beta1.x, beta2.x) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_cc = var(value)^0.5, .groups = "drop")
  
aug_sehat = myresults %>%
  subset(Method %in% c("acc","acc-lambda", "macc", "macc-lambda", "aipw", "aipw-lambda", "ipw", "mle")) %>%
  dplyr::select(Method, n, beta0.x, beta1.x, beta2.x) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_aug = var(value)^0.5, .groups = "drop")
  
emp_sehat = left_join(aug_sehat, cc_sehat, by = c("n", "Parameter")) %>%
  mutate(myratio = se_aug/se_cc) %>%
  mutate(type = "Empirical SD")



## mean SE  
cc_sehat <- myresults %>%
  subset(Method %in% c("cc")) %>%
  dplyr::select(Method, n, beta0.y, beta1.y, beta2.y) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_cc = mean(value), .groups = "drop")
  
aug_sehat = myresults %>%
  subset(Method %in% c("acc","acc-lambda", "macc", "macc-lambda", "aipw", "aipw-lambda", "ipw", "mle")) %>%
  dplyr::select(Method, n, beta0.y, beta1.y, beta2.y) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_aug = mean(value), .groups = "drop")
  
mean_sehat = left_join(aug_sehat, cc_sehat, by = c("n", "Parameter")) %>%
  mutate(myratio = se_aug/se_cc) %>%
  mutate(type = "Mean SE")

colnames(mean_sehat) = colnames(emp_sehat)  
myresults_sehat = rbind(mean_sehat, emp_sehat) %>%
  mutate(Method.x = factor(Method.x, levels = c("ipw","mle","acc", "acc-lambda", "macc", "macc-lambda" , "aipw", "aipw-lambda"))) %>%
  mutate(Parameter = ifelse(Parameter %in% c("beta0.y","beta0.x"), "beta0", ifelse(Parameter %in% c("beta1.y", "beta1.x"), "beta1", "beta2")))

p1 <- myresults %>%
  subset(Method.c %in% c("cc","acc", "macc", "ipw", "aipw", "mle")) %>%
  select(Method, Method.c, beta0.x, beta1.x, beta2.x, n) %>%
  reshape2::melt(id.vars = c("Method", "Method.c", "n"), variable.name = c("Parameter")) %>%
  mutate(Parameter = factor(Parameter, levels = c("beta0.x", "beta2.x", "beta1.x"))) %>%
  subset(Method != "mle-w" & n =="1000") %>%
  mutate(Method = factor(Method, labels = c("CC", "IPW", "IPW, incorrect \u03C0(w,z)", "IPW, correct \u03C0(y,z)", 
                                            "ACC", "ACC, incorrect \u03a8(y,z)", "ACC, incorrect \u03C0(y,z)", "ACC, incorrect \u03C0(y,z) and \u03a8(y,z)", "ACC with \u039b", "ACC with \u039b, incorrect \u03C0(y,z)",
                                            "MACC", "MACC, incorrect \u03a8(y,z)", "MACC, , incorrect \u03C0(w,z)", "MACC, incorrect \u03C0(w,z) and \u03a8(y,z)", "MACC with \u039b", "MACC with \u039b, incorrect \u03C0(w,z)",
                                            "AIPW", "AIPW, incorrect \u03a8(y,z)", "AIPW, , incorrect \u03C0(w,z)", "AIPW, incorrect \u03C0(w,z) and \u03a8(y,z)", "AIPW with \u039b", "AIPW with \u039b, incorrect \u03C0(w,z)", "MLE"))) %>%
  ggplot(aes(y=value, x=Method, fill=Method.c)) +
  geom_boxplot(outlier.color = "black", outlier.size = 0.2, position=position_dodge(.9)) +
  stat_summary(fun=mean, colour="white", geom="point", shape=18, size=2, position=position_dodge(.9)) +
  facet_grid(scales = "free",
             #cols = vars(n), 
             rows = vars(Parameter), 
             labeller = labeller(Parameter = c(beta0.x = "Intercept: \u03B2\u2080=1", 
                                               beta2.x = "A-X: \u03B2\u2081=3", 
                                               beta1.x = "Z: \u03B2\u2082=2"), 
                                type = c("Emprical SE", "Mean SE"))) +  # or any other appropriate label for `type`
  geom_hline(data = . %>% filter(Parameter == "beta0.x"), aes(yintercept = 1), linetype = "dashed", color = "red") +
  geom_hline(data = . %>% filter(Parameter == "beta2.x"), aes(yintercept = 3), linetype = "dashed", color = "red") +
  geom_hline(data = . %>% filter(Parameter == "beta1.x"), aes(yintercept = 2), linetype = "dashed", color = "red") +
  theme_bw() + theme(legend.position="bottom") +
  theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust=1, size = 10),
        legend.text.align = 0,
        axis.text.y = element_text(size=12)) +
  guides(fill=guide_legend(nrow=1)) +
  labs(x = "", y = TeX("Estimate"), fill = "Estimator", title = "Distribution of simulation results under independent covariate right-censoring") +
  scale_fill_manual(labels = c("CC", "IPW", "ACC", "MACC", "AIPW", "MLE"), 
                    values = c("#000000","#E69F00","#56B4E9","#009E73",
                               "#F0E442", "#0072B2")) 

p1

## uncomment to save plot  
# ggsave(filename = "independent-boxplots-combined.png",  width = 8, height = 9, dpi = 300, units = "in")
  
p2 <- myresults_sehat %>%
  mutate(type = factor(type, levels = c("Mean SE", "Empirical SD"))) %>%
  mutate(Parameter = factor(Parameter, levels = c("beta0", "beta2", "beta1"))) %>%
  ggplot(aes(x=n, y=myratio, color=Method.x)) +
  geom_point(size=2) +
  geom_line(linewidth = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  facet_grid(cols = vars(type), 
             rows = vars(Parameter), 
             labeller = labeller(Parameter = c(beta0 = "Intercept: \u03B2\u2080", 
                                               beta2 = "A-X: \u03B2\u2081", 
                                               beta1 = "Z: \u03B2\u2082"), 
                                type = c("Emprical SE", "Mean SE"))) +  # or any other appropriate label for `type`
  theme_bw() +
  theme(legend.position = "bottom") +
labs(title = "Efficiency comparative of estimators with the CC estimator for \nindependent covariate right-censoring",
     y = expression("Efficiency Ratio (%) = " * SE[Estimator]/SE[CC] * " or " * SD[Estimator]/SD[CC]), 
     color = "Estimator", 
     x = "Sample size (n)") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = unique(myresults_sehat$n)) +
  guides(color = guide_legend(nrow = 2)) +
  scale_color_manual(labels = c("IPW", "MLE", "ACC", 
                                expression("ACC with " * Lambda), 
                                "MACC", 
                                expression("MACC with " * Lambda), 
                                "AIPW", 
                                expression("AIPW with " * Lambda)), 
                    values = c("#000000", "#E69F00", "#56B4E9", 
                               "#009E73", "#F0E442", "#0072B2", 
                               "#D55E00", "#CC79A7"))

p2

## Uncomment to save plot  
# ggsave(filename = "independent-efficiency.png",  width = 7, height = 9, dpi = 300, units = "in")

2.2 Dependent covariate right-censoring

2.2.1 Generate data

As in independent covariate-right censoring, we will now show how to simulate data under dependent covariate right-censoring using the function data_mvn(). Instead of specifying dep_censoring = FALSE, we will set it equal to TRUE. This will make the partial correlation of \((X,C)\) given \(Z\) not equal zero. In the following subsections, we replicate the same analyses as in independent covariate right-censoring, but for the dependent case. As before, the oracle mean and standard deviation of the distribution for \(X\) are 0 and 1. For \(C\), these values are 0 and 2.

# generate data under dependent covariate right censoring
set.seed(0)
dep_censoring. = TRUE
dat = data_mvn(nSubjects = n, dep_censoring = dep_censoring.)

# visualize data
head(dat, 2) %>% paged_table()

Under dependent covariate right-censoring we assume that \(X \not\perp C | Z\). We can check this assumption by comparing the marginal correlation of \((X,C)\) and the partial correlation given \(Z\).

# check bivariate correlations 
dat %>% select(X,C,Z) %>% cor() %>% round(2)

##      X    C    Z
## X 1.00 0.28 0.55
## C 0.28 1.00 0.23
## Z 0.55 0.23 1.00

# check partial correlation between (X,C) given Z
partial_cor = dat %>% select(X,C,Z) %>% ppcor::pcor() 
partial_cor$estimate %>% round(2)

##      X    C    Z
## X 1.00 0.19 0.52
## C 0.19 1.00 0.10
## Z 0.52 0.10 1.00

In both cases, the marginal and partial correlations between \((X,C)\) are non-zero. Therefore, our simulation scenario allows us to consider the case when \(X\) and \(C\) are not conditionally independent given \(Z\). The censoring rate remains close to the 50% level (actual: 0.49). The distribution of \((X,C)\) is as follows,

dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>%
  ggplot(aes(x = X, y = C, colour = D)) + geom_point() +
  theme_minimal() +
  geom_abline(intercept = 0, slope = 1) +
  # Shaded area above the diagonal line X = C
  geom_polygon(data = data.frame(x = c(-5, -5, 5), y = c(-5, 5, 5)),
               aes(x = x, y = y), fill = "#00BFC4", alpha = 0.1, inherit.aes = FALSE) + 
  # Shaded area below the diagonal line X = C
  geom_polygon(data = data.frame(x = c(-5, 5, 5), y = c(-5, -5, 5)),
               aes(x = x, y = y), fill = "#F8766D", alpha = 0.1, inherit.aes = FALSE) +
  # add annotations for mean and sd
  annotate("text", x = 3, y = -4, 
           label = paste0("Mean X: ", round(mean(dat$X), 2), "\nSD X: ", round(var(dat$X)^0.5, 2)), 
           hjust = 0, size = 4, color = "black") +
  annotate("text", x = -4.5, y = 4, 
           label = paste0("Mean C: ", round(mean(dat$C), 2), "\nSD C: ", round(var(dat$C)^0.5, 2)), 
           hjust = 0, size = 4, color = "black") +
  ylim(-5, 5) + xlim(-5, 5) +
  labs(title = "Scatterplot of X vs. C", x = "X", y = "C", colour = "") +
  theme(legend.position = "bottom")

Similar to independent covariate right-censoring, we observe that the probabilities \(\pi_{X,Z}(w,z)\) become smaller the larger \(W\).

# scatter plot for pi(w,z)
dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>% 
  ggplot(aes(x= W ,y=myp_xz, colour =D)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatter plot of \u03C0(w,z) vs. W", y = "\u03C0(w,z)", x = "W", colour = "") +
  theme(legend.position="bottom")

## `geom_smooth()` using formula = 'y ~ x'

The probability values defined by \((Y,Z)\) are similar to those defined by \((W,Z)\). In fact, the correlation between these values remains high under dependent covariate right-censoring as well (correlation: 0.93).

# scatter plot for pi(y,z)
dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>%
  ggplot(aes(x= myp_xz ,y=myp_yz, colour =D)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatter plot of \u03C0(w,z) vs. \u03C0(y,z)", y = "\u03C0(y,z)", x = "\u03C0(w,z)",  colour = "") +
  theme(legend.position="bottom")

## `geom_smooth()` using formula = 'y ~ x'

2.2.2 Estimate parameters

Now that we have generated the data, we will estimate the linear regression parameters \(\theta = (\beta_0, \beta_1, \beta_2, \sigma)\) and their corresponding standard error estimates. Estimates from one simulation can be obtained similarly using the following code. The only difference is that the “oracle” probability of observing \(X\) and the estimators now account for the conditional dependence between \(X\) and \(C\) given \(Z\):

tic()
##################################
# uniform (0.1, 0.9) distribution
dat$myp_uniform = runif(n, min=0.1, max = 0.9)

##################################
# Oracle
est00 = estimate_beta(data_yXZ = dat, model = y ~ AX + Z) # oracle

# Naive
est10 = estimate_beta(data_yXZ = dat, model = y ~ AW + Z) # naive

# CC
est20 = estimate_beta_cc(data_yXZ = dat, model = y ~ AW + Z) # CC

# IPW
est30 = estimate_beta_ipw(data_yXZ = dat, model = y ~ AW+Z, myweight = "oracle") # oracle weight pi(w,z)
est31 = estimate_beta_ipw(data_yXZ = dat, model = y ~ AW+Z, myweight = "yz") # oracle weight pi(y,z)
est32 = estimate_beta_ipw(data_yXZ = dat, model = y ~ AW+Z, myweight = "uniform") # random weights

# MLE
est40 = estimate_beta_likelihood(dat, y ~ AW+Z, eta1= FALSE, dep_censoring = dep_censoring.) # correct specification of f(X,Z)
est41 = estimate_beta_likelihood(dat, y ~ AW+Z, eta1= TRUE, dep_censoring = dep_censoring.) # incorrect specification of f(X|Z)

# ACC
est50 = estimate_beta_acc(dat, y ~ AW+Z, eta1= FALSE, myweight = "oracle", dep_censoring = dep_censoring.)
est51 = estimate_beta_acc(dat, y ~ AW+Z, eta1= TRUE, myweight = "oracle", dep_censoring = dep_censoring.)
est52 = estimate_beta_acc(dat, y ~ AW+Z, eta1= FALSE, myweight = "uniform", dep_censoring = dep_censoring.)
est53 = estimate_beta_acc(dat, y ~ AW+Z, eta1= TRUE, myweight = "uniform", dep_censoring = dep_censoring.)
est54 = estimate_beta_acc_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est55 = estimate_beta_acc_lambda_close(dat, y ~ AW+Z, myweight = "uniform")

# MACC
est60 = estimate_beta_macc(dat, y ~ AW+Z, eta1= FALSE, myweight = "oracle", dep_censoring = dep_censoring.)
est61 = estimate_beta_macc(dat, y ~ AW+Z, eta1= TRUE, myweight = "oracle", dep_censoring = dep_censoring.)
est62 = estimate_beta_macc(dat, y ~ AW+Z, eta1= FALSE, myweight = "uniform", dep_censoring = dep_censoring.)
est63 = estimate_beta_macc(dat, y ~ AW+Z, eta1= TRUE, myweight = "uniform", dep_censoring = dep_censoring.)
est64 = estimate_beta_macc_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est65 = estimate_beta_macc_lambda_close(dat, y ~ AW+Z, myweight = "uniform")

# AIPW
est70 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= FALSE, myweight = "oracle", dep_censoring = dep_censoring.)
est71 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= TRUE, myweight = "oracle", dep_censoring = dep_censoring.)
est72 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= FALSE, myweight = "uniform", dep_censoring = dep_censoring.)
est73 = estimate_beta_aipw(dat, y ~ AW+Z, eta1= TRUE, myweight = "uniform", dep_censoring = dep_censoring.)
est74 = estimate_beta_aipw_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est75 = estimate_beta_aipw_lambda_close(dat, y ~ AW+Z, myweight = "uniform")

toc()

# combine results

myreturn = data.frame(cbind(c("oracle","naive", "cc", 
                              "ipw-wz", "ipw-yz", "ipw-w", 
                              "acc", "acc-psi", "acc-w", "acc-psi-w", "acc-lambda", "acc-lambda-w", 
                              "macc", "macc-psi", "macc-w", "macc-psi-w", "macc-lambda", "macc-lambda-w", 
                              "aipw", "aipw-psi", "aipw-w", "aipw-psi-w", "aipw-lambda", "aipw-lambda-w"),
                            # beta estimates + sigma
                            rbind(est00$beta_est,est10$beta_est,est20$beta_est,
                                  est30$beta_est, est31$beta_est, est32$beta_est, 
                                  est50$beta_est, est51$beta_est, est52$beta_est, est53$beta_est, est54$beta_est, est55$beta_est,
                                  est60$beta_est, est61$beta_est, est62$beta_est, est63$beta_est, est64$beta_est, est65$beta_est,
                                  est70$beta_est, est71$beta_est, est62$beta_est, est73$beta_est, est74$beta_est, est75$beta_est),
                            # standard error estimates
                            rbind(est00$se_est,est10$se_est,est20$se_est,
                                  est30$se_est,est31$se_est,est32$se_est, 
                                  est50$se_est, est51$se_est, est52$se_est, est53$se_est, est54$se_est, est55$se_est,
                                  est60$se_est, est61$se_est, est62$se_est, est63$se_est, est64$se_est, est65$se_est,
                                  est70$se_est, est71$se_est, est72$se_est, est73$se_est, est74$se_est, est75$se_est)
  ))
colnames(myreturn) = c("Method",
                        "beta0.x", "beta1.x", "beta2.x","sigma.x",
                        "beta0.y", "beta1.y", "beta2.y")
  

myreturn_mle = data.frame(cbind(c("mle","mle-fxz"),
                            rbind(est40$beta_est, est41$beta_est),
                            rbind(est40$se_est, est41$se_est)))
colnames(myreturn_mle) = c("Method",
                        "beta0.x", "beta1.x", "beta2.x","sigma.x",
                        "beta0.y", "beta1.y", "beta2.y") 

myreturn = rbind(myreturn, myreturn_mle)
paged_table(myreturn)

Alternatively, we can use the function simcalc() to calculate the simualtions. Instead of dep_censoring. =FALSE, we will set it equal to TRUE so that the partial correlation of (X,C) given Z does not equal zero.

tic(); 
myresults = parallel::mclapply(1:3, function(x) simcalc(x, dep_censoring. = dep_censoring.))
toc()

Similar to the independent covariate right-censoring case, we computed all simulations using the UNC Longleaf cluster. These results were then saved and uploaded to this repository. We now load these and replicate the results presented in the manuscript.

2.2.3 Simulation study results

## Load files from Dependent folder
# Sample size 100
list_est <- list.files(path = paste0("../01-Data/Known nuisance parameters/Dependent/100-V3"), pattern = "est_results*", full.names = T)
myresults_a = readmysims(list_est, 100)

# Sample size 300
list_est <- list.files(path = paste0("../01-Data/Known nuisance parameters/Dependent/300-V3"), pattern = "est_results*", full.names = T)
myresults_b = readmysims(list_est, 300)

# Sample size 1000
list_est <- list.files(path = paste0("../01-Data/Known nuisance parameters/Dependent/1000-V3"), pattern = "est_results*", full.names = T)
myresults_c = readmysims(list_est, 1000)

# combine and format
myresults = rbind(myresults_a, myresults_b, myresults_c) %>%
            data.frame() %>%
             mutate(Method =
    factor(Method,
          levels = c("oracle", "cc", "naive",
                    "ipw", "ipw-w" , "ipw-hat", "ipw-yz",
                    "acc","acc-eta1", "acc-w","acc-eta1-w", "acc-lambda", "acc-lambda-hat", "acc-lambda-eta1", "acc-lambda-w",  "acc-lambda-eta1-w",
                    "macc","macc-eta1", "macc-w","macc-eta1-w", "macc-lambda", "macc-lambda-hat", "macc-lambda-eta1", "macc-lambda-w",  "macc-lambda-eta1-w",
                    "aipw", "aipw-eta1", "aipw-w", "aipw-eta1-w" ,"aipw-lambda", "aipw-lambda-hat","aipw-lambda-eta1", "aipw-lambda-w","aipw-lambda-eta1-w",

                    "mle", "mle-w"))) %>%
  mutate(Method.c = ifelse(Method %in% c("ipw", "ipw-w" , "ipw-hat", "ipw-yz"), "ipw",
                      ifelse(Method %in% c("aipw", "aipw-eta1", "aipw-w", "aipw-eta1-w" ,"aipw-lambda", "aipw-lambda-hat","aipw-lambda-eta1", "aipw-lambda-w","aipw-lambda-eta1-w"), "aipw",
                      ifelse(Method %in% c("acc","acc-eta1", "acc-w","acc-eta1-w", "acc-lambda", "acc-lambda-hat", "acc-lambda-eta1", "acc-lambda-w",  "acc-lambda-eta1-w"), "acc",
                      ifelse(Method %in% c("macc","macc-eta1", "macc-w","macc-eta1-w", "macc-lambda", "macc-lambda-hat", "macc-lambda-eta1", "macc-lambda-w",  "macc-lambda-eta1-w"), "macc",
                      ifelse(Method %in% c("mle","mle-w"), "mle",
                      ifelse(Method == "oracle", "oracle",
                      ifelse(Method == "cc", "cc", "naive")) )) )))) %>%
  mutate(Method.c = ifelse(Method == "naive", "naive", Method.c)) %>%
    mutate(Method.c = factor(Method.c, levels = c("oracle", "cc", "ipw", "acc", "macc", "aipw","mle") ))

As before, we compute the mean estimate of \(\theta\) (i.e., \(N^{-1}\sum_{i=1}^N\hat{\theta}_i\)); the empirical standard deviation of \(\hat\theta\) across all simulations; and the empirical mean of the estimated standard error (i.e., \(N^{-1}\sum_{i=1}^{N} \hat{\rm SE}_i\)); and percent 95% coverage.

myresults = myresults %>% 
  mutate(beta0.x = as.numeric(beta0.x),
         beta1.x = as.numeric(beta1.x),
         beta2.x = as.numeric(beta2.x),
         sigma.x = as.numeric(sigma.x),
         beta0.y = as.numeric(beta0.y),
         beta1.y = as.numeric(beta1.y),
         beta2.y = as.numeric(beta2.y)) %>%
  # subset only those simulations that
  filter(abs(beta0.x-theta[1])/theta[1] < est_threshold &
           abs(beta1.x-theta[2])/theta[2] < est_threshold &
           abs(beta2.x-theta[3])/theta[3] < est_threshold &
           abs(sigma.x-theta[4])/theta[4] < est_threshold) %>%
  na.omit()

# Calculate bias 
myresults$beta0_coverage <- ((myresults$beta0.x + qnorm(0.975)*myresults$beta0.y) >= theta[1]) & 
  ((myresults$beta0.x - qnorm(0.975)*myresults$beta0.y) <= theta[1])
myresults$beta1_coverage <- ((myresults$beta1.x + qnorm(0.975)*myresults$beta1.y) >= theta[2]) & 
  ((myresults$beta1.x - qnorm(0.975)*myresults$beta1.y) <= theta[2])
myresults$beta2_coverage <- ((myresults$beta2.x + qnorm(0.975)*myresults$beta2.y) >= theta[3]) & 
  ((myresults$beta2.x - qnorm(0.975)*myresults$beta2.y) <= theta[3])

# mean estimates
myresults_est = myresults %>% 
  dplyr::select(Method,n,beta0.x, beta1.x, beta2.x, sigma.x) %>%
  group_by(Method, n) %>%
  summarise_all(function(x) mean(as.numeric(x)) %>% round(2))
myresults_est %>% paged_table()

# mean SE
myresults_se = myresults %>%
  dplyr::select(Method,n,beta0.y, beta1.y, beta2.y) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) mean(as.numeric(x)*100) %>% round(2))
myresults_se  %>% paged_table()

# empirical SE
myresults_sehat = myresults %>%
  dplyr::select(Method,n,beta0.x, beta1.x, beta2.x, sigma.x) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) round(100*var(as.numeric(x))^0.5, 5) %>% round(2))
myresults_sehat  %>% paged_table()

# coverage probabilities
myresults_coverage = myresults %>%
  dplyr::select(Method,n, beta0_coverage, beta1_coverage, beta2_coverage) %>%
  mutate(beta0_coverage = ifelse(beta0_coverage, 1, 0),
         beta1_coverage = ifelse(beta1_coverage, 1, 0),
         beta2_coverage = ifelse(beta2_coverage, 1, 0)) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) round(mean(x)*100,2))
myresults_coverage  %>% paged_table()

We can also recreate the figures, as presented in the Supplementary Material Section S.5.

## efficiency plots
# empirical SE
cc_sehat <- myresults %>%
  subset(Method %in% c("cc")) %>%
  dplyr::select(Method, n, beta0.x, beta1.x, beta2.x) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_cc = var(value)^0.5, .groups = "drop")
  
aug_sehat = myresults %>%
  subset(Method %in% c("acc","acc-lambda", "macc", "macc-lambda", "aipw", "aipw-lambda", "ipw", "mle")) %>%
  dplyr::select(Method, n, beta0.x, beta1.x, beta2.x) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_aug = var(value)^0.5, .groups = "drop")
  
emp_sehat = left_join(aug_sehat, cc_sehat, by = c("n", "Parameter")) %>%
  mutate(myratio = se_aug/se_cc) %>%
  mutate(type = "Empirical SE")



## mean SE  
cc_sehat <- myresults %>%
  subset(Method %in% c("cc")) %>%
  dplyr::select(Method, n, beta0.y, beta1.y, beta2.y) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_cc = mean(value), .groups = "drop")
  
aug_sehat = myresults %>%
  subset(Method %in% c("acc","acc-lambda", "macc", "macc-lambda", "aipw", "aipw-lambda", "ipw", "mle")) %>%
  dplyr::select(Method, n, beta0.y, beta1.y, beta2.y) %>%
  reshape2::melt(id.vars = c("Method", "n"), variable.name = c("Parameter")) %>%
  group_by(Method, n, Parameter) %>%
  summarise(se_aug = mean(value), .groups = "drop")
  
mean_sehat = left_join(aug_sehat, cc_sehat, by = c("n", "Parameter")) %>%
  mutate(myratio = se_aug/se_cc) %>%
  mutate(type = "Mean SE")

colnames(mean_sehat) = colnames(emp_sehat)  
myresults_sehat = rbind(mean_sehat, emp_sehat) %>%
  mutate(Method.x = factor(Method.x, levels = c("ipw","mle","acc", "acc-lambda", "macc", "macc-lambda" , "aipw", "aipw-lambda"))) %>%
  mutate(Parameter = ifelse(Parameter %in% c("beta0.y","beta0.x"), "beta0", ifelse(Parameter %in% c("beta1.y", "beta1.x"), "beta1", "beta2")))

p3 <- myresults %>%
  subset(Method.c %in% c("cc","acc", "macc", "ipw", "aipw", "mle")) %>%
  select(Method, Method.c, beta0.x, beta1.x, beta2.x, n) %>%
  reshape2::melt(id.vars = c("Method", "Method.c", "n"), variable.name = c("Parameter")) %>%
  mutate(Parameter = factor(Parameter, levels = c("beta0.x", "beta2.x", "beta1.x"))) %>%
  subset(Method != "mle-w" & n =="1000") %>%
  mutate(Method = factor(Method, labels = c("CC", "IPW", "IPW, incorrect \u03C0(w,z)", "IPW, correct \u03C0(y,z)", 
                                            "ACC", "ACC, incorrect \u03a8(y,z)", "ACC, incorrect \u03C0(y,z)", "ACC, incorrect \u03C0(y,z) and \u03a8(y,z)", "ACC with \u039b", "ACC with \u039b, incorrect \u03C0(y,z)",
                                            "MACC", "MACC, incorrect \u03a8(y,z)", "MACC, , incorrect \u03C0(w,z)", "MACC, incorrect \u03C0(w,z) and \u03a8(y,z)", "MACC with \u039b", "MACC with \u039b, incorrect \u03C0(w,z)",
                                            "AIPW", "AIPW, incorrect \u03a8(y,z)", "AIPW, , incorrect \u03C0(w,z)", "AIPW, incorrect \u03C0(w,z) and \u03a8(y,z)", "AIPW with \u039b", "AIPW with \u039b, incorrect \u03C0(w,z)", "MLE"))) %>%
  ggplot(aes(y=value, x=Method, fill=Method.c)) +
  geom_boxplot(outlier.color = "black", outlier.size = 0.2, position=position_dodge(.9)) +
  stat_summary(fun=mean, colour="white", geom="point", shape=18, size=2, position=position_dodge(.9)) +
  facet_grid(scales = "free",
             #cols = vars(n), 
             rows = vars(Parameter), 
             labeller = labeller(Parameter = c(beta0.x = "Intercept: \u03B2\u2080=1", 
                                               beta2.x = "A-X: \u03B2\u2081=3", 
                                               beta1.x = "Z: \u03B2\u2082=2"), 
                                type = c("Emprical SE", "Mean SE"))) +  # or any other appropriate label for `type`
  geom_hline(data = . %>% filter(Parameter == "beta0.x"), aes(yintercept = 1), linetype = "dashed", color = "red") +
  geom_hline(data = . %>% filter(Parameter == "beta2.x"), aes(yintercept = 3), linetype = "dashed", color = "red") +
  geom_hline(data = . %>% filter(Parameter == "beta1.x"), aes(yintercept = 2), linetype = "dashed", color = "red") +
  theme_bw() + theme(legend.position="bottom") +
  theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust=1, size = 10),
        legend.text.align = 0,
        axis.text.y = element_text(size=12)) +
  guides(fill=guide_legend(nrow=1)) +
  labs(x = "", y = TeX("Estimate"), fill = "Estimator", title = "Distribution of simulation results under dependent covariate right-censoring") +
  scale_fill_manual(labels = c("CC", "IPW", "ACC", "MACC", "AIPW", "MLE"), 
                    values = c("#000000","#E69F00","#56B4E9","#009E73",
                               "#F0E442", "#0072B2")) 

p3

## Uncomment to save plot  
# ggsave(filename = "dependent-boxplots-combined.png",  width = 8, height = 9, dpi = 300, units = "in")
  
p4 <- myresults_sehat %>%
  mutate(Parameter = factor(Parameter, levels = c("beta0", "beta2", "beta1"))) %>%
  ggplot(aes(x=n, y=myratio, color=Method.x)) +
  geom_point(size=2) +
  geom_line(linewidth = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  facet_grid(cols = vars(type), 
             rows = vars(Parameter), 
             labeller = labeller(Parameter = c(beta0 = "Intercept: \u03B2\u2080", 
                                               beta2 = "A-X: \u03B2\u2081", 
                                               beta1 = "Z: \u03B2\u2082"), 
                                type = c("Emprical SE", "Mean SE"))) +  # or any other appropriate label for `type`
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Efficiency comparative of estimators with the CC estimator for \ndependent covariate right-censoring",
       y = expression("Efficiency Ratio (%) = " * SE[Method] / SE[CC]), 
       color = "Estimator", 
       x = "Sample size (n)") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = unique(myresults_sehat$n)) +
  guides(color = guide_legend(nrow = 2)) +
  scale_color_manual(labels = c("IPW", "MLE", "ACC", 
                                expression("ACC with " * Lambda), 
                                "MACC", 
                                expression("MACC with " * Lambda), 
                                "AIPW", 
                                expression("AIPW with " * Lambda)), 
                    values = c("#000000", "#E69F00", "#56B4E9", 
                               "#009E73", "#F0E442", "#0072B2", 
                               "#D55E00", "#CC79A7"))

p4

## Uncomment to save plot 
# ggsave(filename = "dependent-efficiency.png",  width = 7, height = 9, dpi = 300, units = "in")

3 Estimated nuisance parameters

3.1 Independent covariate right-censoring

3.1.1 Generate data

Up to this point, we have explained how to recreate the simulation results under the assumption that the nuisance distributions are known. Specifically:

For the IPW estimator, we assumed that the parameters governing the probabilities \(\pi_{X,Z}(w,z)\) were known.
For the ACC, MACC, AIPW estimator, we assumed that the parameters governing the probabilities \(\pi_{Y,Z}(y,z)\) and \(\pi_{X,Z}(x,z)\), and the augmented components were known.
For the MLE, we assumed that the distribution \(f_{X|Z}(x,z)\) was assumed known.

However, in practice, these distributions are typically not known and must be estimated. Using our data simulation strategy, we estimate the parameters of these distributions by finding the set of parameters \(\alpha = (\mu_{X,C|Z}, \sigma_{X|Z}, \sigma_{C|Z}, \sigma_{X,C|Z})\) that maximizes the conditional bivariate normal distribution of \(f_{W,\Delta|Z}\). In other words, we find

\[\hat{\alpha} = \underset{\alpha}{\mathrm{argmax}} \sum_{i}^n\left\{\delta_i\log\int_{w_i}^{\infty}f_{C,X|Z}(c,x,z;\alpha)dc + (1-\delta_i)\log\int_{w_i}^{\infty}f_{C,X|Z}(c,x,z;\alpha) dx \right\}.\]

To estimate \(\alpha\), we define a helper function (as shown next) that finds the values of \(\alpha\) that maximize the above expression using optimx().

# generate data under independent covariate right censoring
set.seed(0)
dep_censoring. = FALSE
dat = data_mvn(nSubjects = n, dep_censoring = dep_censoring.)

logLik_noninf = function(myalpha, mydata){
    
    # density function
    cxz_likelihood = function(w,delta,z){
      
      # 1. build my own bivariate density f(x,c|z)
      mydmvnorm = function(xzc){
        
        # need these for integration components
        meanXZ = myalpha[1] + myalpha[2]*xzc[2]
        meanCZ = myalpha[3] + myalpha[4]*xzc[2]
        sdXZ = myalpha[5]
        sdCZ = myalpha[6]
        
        val = dnorm(xzc[1], mean = meanXZ, sd = sdXZ)*
          dnorm(xzc[3], mean = meanCZ, sd = sdCZ) # C|X,Z 
        return(val)
      }
      
      # 2. Integrate components of X and C that are missing
      dmvnorm_delta1 = function(t) unlist(lapply(t, function(tt) mydmvnorm(c(w,z,tt))))
      dmvnorm_delta0 = function(t) unlist(lapply(t, function(tt) mydmvnorm(c(tt,z,w))))
      mydensity = ifelse(delta==1,
                         integrate(dmvnorm_delta1,lower = w, upper = Inf)$value,
                         integrate(dmvnorm_delta0,lower = w, upper = Inf)$value)
      return(mydensity)
    }
    
    # integrate the density function above
    myreturn = lapply(1:nrow(mydata),
                      function(tt) cxz_likelihood(mydata$W[tt], mydata$D[tt], mydata$Z[tt])) %>%
      unlist() %>% log() %>% sum()
    return(myreturn)
    
} 

# select values close to the oracle values: 
g1 = c(0,0.5,0,0.5,0.86,1.94) + 0.1

myoptim1 = optimx(
  par = g1, # initial values for the parameters.
  fn = function(x,X){logLik_noninf(myalpha = x,mydata = X)}, # log likelihood
  X = dat,
  method = "Nelder-Mead",
  control = list(
    trace = F, # higher number print more detailed output
    maximize = T, # default is to minimize
    abstol= 10^(-4)
  )
)

## Maximizing -- use negfn and neggr

# print estimates 
myalpha1 = unlist(myoptim1[1:6])
myalpha1

##           p1           p2           p3           p4           p5           p6 
## -0.004088643  0.569454518  0.080784448  0.527268469  0.819311337  1.955206211

Using \(\hat\alpha\), we compute (i) probability of \(X\) being observed using \(\pi_{X,Z}(w,z) = \int_{w \leq c} f_{C|Z}(c,z) dc\) and \(\pi_{Y,Z}(y,z) = int \pi_{X,Z}(w,z) f_{X|Y,Z}(x,y,z) dx\), (ii) define the density \(f_{X|Z}\) to calculate the augmented component of the ACC, MACC and AIPW estimators, and (iii) define the density \(f_{X|Z}\) for the MLE. To evaluate the fit of \(\hat\alpha\) we compute the probabilities \(\hat{\pi}_{X,Z}(w,z)\) and plot them against the oracle values \(\pi_{X,Z}(w,z)\).

# calculate the probability of X being observed
myp = function(myalpha,data){
    meanCZ = myalpha[3] + myalpha[4]*dat$Z
    sdCZ = myalpha[6]
    myphat = pnorm(data$W, mean = meanCZ, sd = sdCZ, lower.tail = FALSE)
    return(myphat)
}

dat$myp_xz_mle = myp(myalpha1,dat)

# scatter plot of oracle vs estimated probabilities
dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>% 
  ggplot(aes(x= myp_xz_mle ,y=myp_xz, colour =D)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatter plot of oracle vs. estimated \u03C0(w,z) ", y = "oracle", x = "estimated", colour = "") +
  theme(legend.position="bottom")

## `geom_smooth()` using formula = 'y ~ x'

As shown in the plot above, the probability estimates for when \(X\) is observed (i.e., blue points) strongly resemble the oracle probabilities. In fact, the Pearson correlation between the oracle and the estimated probabilities when \(X\) is observed (i.e., blue dots) is equal to 0.9994. We now investigate the linear fit of \(\hat{f}_{X|Z}(x,z)\) and the oracle distribution \(f_{X|Z}\).

# oracle X|Z values
dat$meanXZ = 0 + 0.5/1*(dat$Z-0) # mu_x + (var_xz/var_z)*(data.$Z-mu_z)

#estimated X|Z values
dat$meanXZ_hat = myalpha1[1] + myalpha1[2]*dat$Z

# scatter plot of oracle vs estimated probabilities
dat %>%
  ggplot(aes(x= meanXZ_hat ,y=meanXZ))+
  geom_point() +
  labs(title = "Scatter plot of oracle vs. estimated \nvalues of X conditional on Z ", x = "Estimated", y = "Oracle", colour = "") +
  ylim(-2,2) + xlim(-2,2)

As plotted above, the estimated distribution of \(X\) conditional on \(Z\) closely matches the oracle values. After calculating the integral \(\pi_{Y,Z}(X,Z) = \int \pi_{X,Z}(x,z) f_{X|Y,Z}(x,y,z) dx\), where \(\pi_{X,Z}(x,z)\) is the estimated probability from the previous step, we find that the estimated probabilities are closely and linearly associated with the oracle probabilities, as shown below.

myp_yz.b = function(myalpha, data){

  meanXZ = myalpha[1] + myalpha[2]*data$Z
  sdXZ = myalpha[5]  
  meanCZ = myalpha[3] + myalpha[4]*data$Z
  sdCZ = myalpha[6]

  integral_func_num.b = function(t){
    val = rep(NA, length(t))
    for (i in 1:length(t)){
      myp = pnorm(t[i], mean = meanCZ, sd = sdCZ, lower.tail = FALSE)
      val[i] = myp*
        dnorm(data$y -  cbind(1,data$Z,data$A - t[i])%*%theta[1:3], 0, sd = theta[4])*
        dnorm(t[i], mean = meanXZ, sd = sdXZ)
    }
    return(val)
  }
  myp_yz_num = integrate(integral_func_num.b, lower = -Inf, upper = Inf, 
                     rel.tol = 1e-7, abs.tol = 1e-7)$value
  
  integral_func_denom.b = function(t){
    val = rep(NA, length(t))
    for (i in 1:length(t)){
      val[i] = dnorm(data$y -  cbind(1,data$Z,data$A - t[i])%*%theta[1:3], 0, sd = theta[4])*
        dnorm(t[i], mean = meanXZ, sd = sdXZ)
    }
    return(val)
  }
  myp_yz_denom = integrate(integral_func_denom.b, lower = -Inf, upper = Inf, rel.tol = 1e-7, abs.tol = 1e-7)$value
  return(myp_yz_num/myp_yz_denom)
} 

dat$myp_yz_est = lapply(1:nrow(dat), function(t.) myp_yz.b(myalpha1,dat[t.,])) %>% unlist()

# scatter plot of oracle vs estimated probabilities
dat %>%
  ggplot(aes(x= myp_yz_est ,y=myp_yz))+
  geom_point() +
  labs(title = "Scatter plot of oracle vs. estimated probabilities \u03C0(y,z) ", x = "Estimated", y = "Oracle") +
  theme(legend.position="bottom")

Therefore, we conclude that the estimated nuisance parameters \(\hat{\alpha}\) closely resemble the oracle values. These simulation results agree with our theoretical work showing that \(\hat\balpha\) is a consistent estimator of \(\alpha\).

3.1.2 Estimate parameters

Now that we have estimated the nuisance parameters, we will compute \(\theta\) using the estimated probability of observing \(X\) and the estimated density of \(f_{X|Z}(x,z)\). Additionally, we compute the updated standard error estimated.

dat$myp_xz = dat$myp_xz_mle
dat$myp_yz = dat$myp_yz_est

tic()
##################################
# Oracle
est00 = estimate_beta(data_yXZ = dat, model = y ~ AX + Z) # oracle

# Naive
est10 = estimate_beta(data_yXZ = dat, model = y ~ AW + Z) # naive

# CC
est20 = estimate_beta_cc(data_yXZ = dat, model = y ~ AW + Z) # CC

# IPW
est30 = estimate_beta_ipw(data_yXZ = dat, model = y ~ AW+Z, myweight = "oracle") # oracle 
est30_updatedSE = var_beta_ipw(data_yXZ=dat, mytheta= est30$beta_est, myalpha = myalpha1)

# MLE
est40 = estimate_beta_likelihood_hat(data_yXZ = dat, model = y ~ AW+Z, myalpha = myalpha1) 
est40_updatedSE = var_beta_likelihood(data_yXZ=dat, mytheta= est40$beta_est, myalpha = myalpha1)

# MACC
est60 = estimate_beta_macc_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est60_updatedSE = var_beta_macc_lambda(data_yXZ=dat, mytheta= est60$beta_est, myalpha = myalpha1)

## [1] "A is done"
##             [,1]         [,2]        [,3]
## [1,] -0.00357703  0.001257981 -0.01667726
## [2,]  0.00476640  0.088825433 -0.04210634
## [3,]  0.02664510 -0.030940724  0.12708171
## [1] "bread"
##               [,1]          [,2]          [,3]
## [1,] -2.214244e-03  0.0004207112  6.105457e-05
## [2,]  4.119886e-04 -0.0014293925 -7.075807e-04
## [3,]  7.180359e-05 -0.0007264014 -2.561271e-03
## [1] "meat"
##           [,1]      [,2]       [,3]
## [1,] 479.22138  149.3080  -14.20005
## [2,] 149.30803  904.9115 -259.40611
## [3,] -14.20005 -259.4061  519.83050

# AIPW
est70 = estimate_beta_aipw_lambda_close(dat, y ~ AW+Z, myweight = "oracle")
est70_updatedSE = var_beta_aipw_lambda(data_yXZ=dat, mytheta= est70$beta_est, myalpha = myalpha1)

## [1] "A is done"
##             [,1]        [,2]        [,3]
## [1,] -0.05804621 -0.10664146 -0.17133321
## [2,]  0.21356589  0.26708639  0.09774166
## [3,] -0.11594077  0.04326503  0.51483234
## 711.306 sec elapsed

toc()

# combine results

myreturn = data.frame(cbind(c("oracle","naive", "cc", "ipw", "macc", "aipw"),
                            # beta estimates + sigma
                            rbind(est00$beta_est,est10$beta_est,est20$beta_est,
                                  est30$beta_est, est60$beta_est, est70$beta_est),
                            # standard error estimates
                            rbind(est00$se_est,est10$se_est,est20$se_est,
                                  est30$se_est, est60$se_est, est70$se_est),
                            # updated standard error estimates
                            rbind(est00$se_est,est10$se_est,est20$se_est,
                                  est30_updatedSE$se_updated, 
                                  est60_updatedSE$se_updated, 
                                  est70_updatedSE$se_updated)
  ))

colnames(myreturn) = c("Method",
                        "beta0.x", "beta1.x", "beta2.x","sigma.x",
                        "beta0.y", "beta1.y", "beta2.y",
                        "beta0.updated", "beta1.updated", "beta2.updated")
  

myreturn_mle = data.frame(cbind(c("mle"),
                            rbind(est40$beta_est),
                            rbind(est40$se_est),
                            rbind(est40_updatedSE$se_updated)))

colnames(myreturn_mle) = colnames(myreturn)

myreturn = rbind(myreturn, myreturn_mle)
paged_table(myreturn)

3.1.3 Simulation study results

We computed all simulations using the UNC Longleaf cluster. These results were then saved and uploaded to this repository. We now load these and replicate the results presented in the paper.

# Load files from Dependent folder
# Sample size 100
list_est <- list.files(path = paste0("../01-Data/Estimated nuisance parameters/Independent/1000-Lambda-hat"), pattern = "est_results*", full.names = T)
myresults_a = readmysims(list_est, 1000)
myresults_a = myresults_a %>%
  subset(Method %in% c("oracle","cc", "naive", "ipw-hat", "acc-lambda-hat", "aipw-lambda-hat")) %>%
  mutate(Method = ifelse(Method == "acc-lambda-hat", "macc-lambda", Method)) %>%
  mutate(Method = ifelse(Method == "aipw-lambda-hat", "aipw-lambda", Method)) %>%
  mutate(Method = ifelse(Method == "ipw-hat", "ipw", Method))
  
list_est <- list.files(path = paste0("../01-Data/Estimated nuisance parameters/Independent/1000-MLE-hat"), pattern = "est_results*", full.names = T)  
myresults_b = readmysims(list_est, 1000)
myresults_b = myresults_b %>%subset(Method == "mle")

# combine and format
myresults = rbind(myresults_a, myresults_b) %>%
             mutate(Method = factor(Method, levels = c("oracle", "cc", "naive", "ipw", "macc-lambda", "aipw-lambda", "mle")))

After importing these, we can compute the mean estimate of \(\theta\), the empirical standard deviation of \(\hat\theta\), the empirical mean of the estimated standard error, and the empirical coverage of the estimated 95% confidence intervals.

myresults = myresults %>% 
  mutate(beta0.x = as.numeric(beta0.x),
         beta1.x = as.numeric(beta1.x),
         beta2.x = as.numeric(beta2.x),
         sigma.x = as.numeric(sigma.x),
         beta0.y = as.numeric(beta0.y),
         beta1.y = as.numeric(beta1.y),
         beta2.y = as.numeric(beta2.y)) %>%
  # subset only those simulations that
  filter(abs(beta0.x-theta[1])/theta[1] < est_threshold &
           abs(beta1.x-theta[2])/theta[2] < est_threshold &
           abs(beta2.x-theta[3])/theta[3] < est_threshold &
           abs(sigma.x-theta[4])/theta[4] < est_threshold) %>%
  na.omit()

# Calculate bias 
myresults$beta0_coverage <- ((myresults$beta0.x + qnorm(0.975)*myresults$beta0.y) >= theta[1]) & 
  ((myresults$beta0.x - qnorm(0.975)*myresults$beta0.y) <= theta[1])
myresults$beta1_coverage <- ((myresults$beta1.x + qnorm(0.975)*myresults$beta1.y) >= theta[2]) & 
  ((myresults$beta1.x - qnorm(0.975)*myresults$beta1.y) <= theta[2])
myresults$beta2_coverage <- ((myresults$beta2.x + qnorm(0.975)*myresults$beta2.y) >= theta[3]) & 
  ((myresults$beta2.x - qnorm(0.975)*myresults$beta2.y) <= theta[3])

# mean estimates
myresults_est = myresults %>% 
  dplyr::select(Method,n,beta0.x, beta1.x, beta2.x, sigma.x) %>%
  group_by(Method, n) %>%
  summarise_all(function(x) mean(as.numeric(x)) %>% round(2))
myresults_est %>% paged_table()

# mean SE
myresults_se = myresults %>%
  dplyr::select(Method,n,beta0.y, beta1.y, beta2.y) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) mean(as.numeric(x)*100) %>% round(2))
myresults_se  %>% paged_table()

# empirical SE
myresults_sehat = myresults %>%
  dplyr::select(Method,n,beta0.x, beta1.x, beta2.x, sigma.x) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) round(100*var(as.numeric(x))^0.5, 5) %>% round(2))
myresults_sehat  %>% paged_table()

# coverage probabilities
myresults_coverage = myresults %>%
  dplyr::select(Method,n, beta0_coverage, beta1_coverage, beta2_coverage) %>%
  mutate(beta0_coverage = ifelse(beta0_coverage, 1, 0),
         beta1_coverage = ifelse(beta1_coverage, 1, 0),
         beta2_coverage = ifelse(beta2_coverage, 1, 0)) %>%
  group_by(Method,n) %>%
  summarise_all(function(x) round(mean(x)*100,2))
myresults_coverage  %>% paged_table()

While not shown in the paper, we can also create a similar figure to those when we assumed that \(\alpha\) was known,

p5 <- myresults %>%
  subset(Method %in% c("cc", "macc-lambda", "ipw", "aipw-lambda", "mle")) %>%
  select(Method, beta0.x, beta1.x, beta2.x) %>%
  reshape2::melt(id.vars = c("Method"), variable.name = c("Parameter")) %>%
  mutate(Parameter = factor(Parameter, levels = c("beta0.x", "beta2.x", "beta1.x"))) %>%
  mutate(Method = factor(Method, labels = c("CC", "IPW", "MACC with \u039b", "AIPW with \u039b", "MLE"))) %>%
  ggplot(aes(y=value, x=Method, fill=Method)) +
  geom_boxplot(outlier.color = "black", outlier.size = 0.2, position=position_dodge(.9)) +
  stat_summary(fun=mean, colour="white", geom="point", shape=18, size=2, position=position_dodge(.9)) +
  facet_grid(scales = "free",
             rows = vars(Parameter), 
             labeller = labeller(Parameter = c(beta0.x = "Intercept: \u03B2\u2080=1", 
                                               beta2.x = "A-X: \u03B2\u2081=3", 
                                               beta1.x = "Z: \u03B2\u2082=2"), 
                                type = c("Emprical SE", "Mean SE"))) +  # or any other appropriate label for `type`
  geom_hline(data = . %>% filter(Parameter == "beta0.x"), aes(yintercept = 1), linetype = "dashed", color = "red") +
  geom_hline(data = . %>% filter(Parameter == "beta2.x"), aes(yintercept = 3), linetype = "dashed", color = "red") +
  geom_hline(data = . %>% filter(Parameter == "beta1.x"), aes(yintercept = 2), linetype = "dashed", color = "red") +
  theme_bw() + theme(legend.position="bottom") +
  theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust=1, size = 10),
        legend.text.align = 0,
        axis.text.y = element_text(size=12)) +
  guides(fill=guide_legend(nrow=1)) +
  labs(x = "", y = TeX("Estimate"), fill = "Estimator", title = "Distribution of simulation results under dependent covariate right-censoring") +
  scale_fill_manual(labels = c("CC", "IPW", "MACC", "AIPW", "MLE"), 
                    values = c("#000000","#E69F00","#009E73",
                               "#F0E442", "#0072B2")) 

p5

## Uncomment to save plot 
# ggsave(filename = "dependent-boxplots-combined-estimated.png",  width = 8, height = 9, dpi = 300, units = "in")

3.2 Dependent covariate right-censoring

3.2.1 Generate data

To address the dependent covariate right-censoring problem, we adopt the data simulation scenario from Bartlette et al. (2024). However, we modify the simulation so that after calculating the data with missing \(X\), we compute \(C = X - beta\{exp(Z), 1\}\) for those observations with incomplete data. This leads to the probability of observing \(X\) be modeled by the logistic model of the form,

\[\pi_{Y,Z}(y,z) = \Pr(\Delta = 1 | Y=y, Z=z) = logit(\alpha_0 + \alpha_1y + \alpha_2 z).\]

The function data_mvn_bartlette() is used to generate the data for this mechanism. The only input to this function is nSubjects which is equal to the number of observations. Next, we show how to generate the data using this function.

# generate data under independent covariate right censoring
set.seed(0)
dep_censoring. = FALSE
dat = data_mvn_bartlette(nSubjects = n)

head(dat) %>% paged_table()

We now estimate the logistic regression coefficients from the model above. The results suggest that higher values of Z and lower values of Y are associated with a higher probability of observing X . The scatter plot below illustrates this relationship.

glm(formula = D ~ y + Z, family = binomial(link = "logit"), data = dat)

## 
## Call:  glm(formula = D ~ y + Z, family = binomial(link = "logit"), data = dat)
## 
## Coefficients:
## (Intercept)            y            Z  
##     -0.6160       0.1645       1.0266  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
## Null Deviance:       1385 
## Residual Deviance: 1134  AIC: 1140

dat %>%
  mutate(D = factor(D, levels = c("0", "1"), labels = c("Right-censored", "Observed"))) %>% 
  ggplot(aes(x=Z ,y=y, colour =myp_yz)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Scatter plot of Y vs. Z", y = "y", x = "Z", colour = "Probability \u03C0(y,z)") +
  theme(legend.position="bottom")

3.2.2 Estimate parameters

We can generate 3000 data sets using the following code. Then we save it as acc_example_stata.csv. After the data is generated, we use the augcc command in STATA to compute the parameter estimates for the ACC estimator with \(\Lambda\).

#################################################################
# generate multiple datasets and save as "acc_example_stata.csv"
#################################################################
set.seed(0); dat = data_mvn(1000)
dat$sim = 0
for (i in 1:3000) {
  set.seed(i)
  dat_temp = data_mvn(1000); dat_temp$sim = i
  dat = rbind(dat, dat_temp)
}
dat = dat %>%
  rename(R=D) %>%
  mutate(Xmiss = ifelse(R==1, X, NA)) %>%
  select(y,X,Xmiss,Z,R,sim)
write.csv(dat, "acc_example_stata.csv", row.names = FALSE, na = "")

The following STATA code were used to compute the parameter estimates for the regression model. The command estimates the probabilities \(\pi_{Y,Z}(y,z)\) using logistic regression and uses these estimates to estimate the ACC estimator with \(\Lambda\). This process was computed for each of the individual data sets generated.

// stata do file to estimate parameters of interest
clear
import delimited "acc_example_stata.csv"

keep if sim == `1' // reduce data to only simulation number 1

// generate f(X|Z), which can be any distribution 
gen imp1 = rnormal(0,1)
gen imp2 = rnormal(0,1)
gen imp3 = rnormal(0,1)
gen imp4 = rnormal(0,1)
gen imp5 = rnormal(0,1)
gen imp6 = rnormal(0,1)
gen imp7 = rnormal(0,1)
gen imp8 = rnormal(0,1)
gen imp9 = rnormal(0,1)
gen imp10 = rnormal(0,1)
gen imp11 = rnormal(0,1)
gen imp12 = rnormal(0,1)
gen imp13 = rnormal(0,1)
gen imp14 = rnormal(0,1)
gen imp15 = rnormal(0,1)
gen imp16 = rnormal(0,1)
gen imp17 = rnormal(0,1)
gen imp18 = rnormal(0,1)
gen imp19 = rnormal(0,1)
gen imp20 = rnormal(0,1)
gen imp21 = rnormal(0,1)
gen imp22 = rnormal(0,1)
gen imp23 = rnormal(0,1)
gen imp24 = rnormal(0,1)
gen imp25 = rnormal(0,1)
gen imp26 = rnormal(0,1)
gen imp27 = rnormal(0,1)
gen imp28 = rnormal(0,1)
gen imp29 = rnormal(0,1)
gen imp30 = rnormal(0,1)


// correct weight 
augcca z, y(y) x(xmiss) imps(imp1-imp30) pivars(y z) 

// incorrect weight specification
augcca z, y(y) x(xmiss) imps(imp1-imp30) pivars(z)

3.2.3 Simulation study results

We computed all simulations using the UNC Longleaf cluster and uploaded results to this repository. We load the simulation results and replicate the results presented in the paper.

## Misspecified model
mypath = paste0("../01-Data/Estimated nuisance parameters/Dependent/1000-ACC-STATA")
list_est <- list.files(path = mypath, pattern = "augacc_z_mysim*", full.names = T)
myresults <-  list_est %>% map_df(function(x) read_excel(x, col_names = FALSE))
myresults_est <- myresults[4*(1:length(list_est))-3, ]
myresults_se <- myresults[-(4*(1:length(list_est))-3), ]
myresults_se_temp <- matrix(nrow=1, ncol=3)
for (i in 1:length(list_est)){
    temp = myresults_se[(i*3-2):(i*3),]
    myresults_se_temp = rbind(myresults_se_temp, 
        cbind(as.numeric(temp[1,1]), as.numeric(temp[2,2]), as.numeric(temp[3,3])))
}
myresults_se = myresults_se_temp[-1,]^0.5
theta = c(0,0.2,0.2,0.95)
myresults_z = cbind(myresults_est, myresults_se)
myresults_z$Method = "acc-z" 

## Correct model
list_est <- list.files(path = mypath, pattern = "augacc_y z_mysim*", full.names = T)
myresults <-  list_est %>% map_df(function(x) read_excel(x, col_names = FALSE))
myresults_est <- myresults[4*(1:length(list_est))-3, ]
myresults_se <- myresults[-(4*(1:length(list_est))-3), ]
myresults_se_temp <- matrix(nrow=1, ncol=3)
for (i in 1:length(list_est)){
    temp = myresults_se[(i*3-2):(i*3),]
    myresults_se_temp = rbind(myresults_se_temp, 
        cbind(as.numeric(temp[1,1]), as.numeric(temp[2,2]), as.numeric(temp[3,3])))
}
myresults_se = myresults_se_temp[-1,]^0.5
theta = c(0,0.2,0.2,0.95)
myresults_yz = cbind(myresults_est, myresults_se)
myresults_yz$Method = "acc-yz" 

## combine both models
myresults = rbind(myresults_yz, myresults_z)
colnames(myresults) = c("beta1.x", "beta2.x", "beta0.x", "beta1.y", "beta2.y", "beta0.y", "Method")

# Calculate bias 
myresults$beta0_coverage <- ((myresults$beta0.x + qnorm(0.975)*myresults$beta0.y) >= theta[1]) & 
  ((myresults$beta0.x - qnorm(0.975)*myresults$beta0.y) <= theta[1])
myresults$beta1_coverage <- ((myresults$beta1.x + qnorm(0.975)*myresults$beta1.y) >= theta[2]) & 
  ((myresults$beta1.x - qnorm(0.975)*myresults$beta1.y) <= theta[2])
myresults$beta2_coverage <- ((myresults$beta2.x + qnorm(0.975)*myresults$beta2.y) >= theta[3]) & 
  ((myresults$beta2.x - qnorm(0.975)*myresults$beta2.y) <= theta[3])

# mean estimates
myresults_est = myresults %>% 
  dplyr::select(Method,beta0.x, beta1.x, beta2.x) %>%
  group_by(Method) %>%
  summarise_all(function(x) mean(as.numeric(x)) %>% round(2))
myresults_est %>% paged_table()

# mean SE
myresults_se = myresults %>%
  dplyr::select(Method,beta0.y, beta1.y, beta2.y) %>%
  group_by(Method) %>%
  summarise_all(function(x) mean(as.numeric(x)*100) %>% round(2))
myresults_se  %>% paged_table()

# empirical SE
myresults_sehat = myresults %>%
  dplyr::select(Method,beta0.x, beta1.x, beta2.x) %>%
  group_by(Method) %>%
  summarise_all(function(x) round(100*var(as.numeric(x))^0.5, 5) %>% round(2))
myresults_sehat  %>% paged_table()

# coverage probabilities
myresults_coverage = myresults %>%
  dplyr::select(Method, beta0_coverage, beta1_coverage, beta2_coverage) %>%
  mutate(beta0_coverage = ifelse(beta0_coverage, 1, 0),
         beta1_coverage = ifelse(beta1_coverage, 1, 0),
         beta2_coverage = ifelse(beta2_coverage, 1, 0)) %>%
  group_by(Method) %>%
  summarise_all(function(x) round(mean(x)*100,2))
myresults_coverage  %>% paged_table()

Establishing the Parallels and Difference between Right-Censored and Missing Covariates

Simulation Tutorial