| Age_3 | Age_7 | Age_28 |
|---|---|---|
| 34.39796 | 46.20179 | 61.09447 |
| 28.79941 | 42.79578 | 59.79825 |
| 33.39822 | 49.20101 | 60.29468 |
| 36.30091 | 46.80163 | 61.79773 |
| 28.99936 | 45.69847 | 56.69561 |
| 37.79707 | 55.59934 | 68.29949 |
| 40.19645 | 54.89608 | 66.89986 |
| 33.39822 | 49.20101 | 60.29468 |
| 28.09615 | 34.90128 | 50.69717 |
| 41.29961 | 46.89816 | 56.39914 |
| 33.39822 | 49.20101 | 60.29468 |
| 25.20035 | 33.39822 | 55.49592 |
| 41.09966 | 54.09629 | 68.49944 |
| 35.30117 | 55.89582 | 71.29871 |
| 28.29610 | 49.80085 | 74.69783 |
| 28.59946 | 47.09811 | 52.20023 |
| 35.30117 | 55.89582 | 71.29871 |
| 24.40056 | 37.99702 | 67.69965 |
| 35.30117 | 55.89582 | 71.29871 |
| 39.30013 | 56.09577 | 65.99664 |
| 40.59635 | 59.09499 | 74.49788 |
| 35.30117 | 22.89750 | 71.29871 |
| 24.09719 | 35.10122 | 49.89738 |
Single Imputation vs Multivariate Normal Imputation
A Demonstration using a Bayesian Approach
Introduction
Recently, I have been revisiting my notes from the course on Applied Bayesian Methods that I took in grad school. There is a chapter in the book (see margin) on the multivariate normal model. One of the sections in the chapter discusses missing data and how to impute them using the multivariate normal model. An exercise problem at the end of the chapter asks to conduct a pairwise comparison of the means of two groups which contain missing data. First, the author suggests using regression imputation to impute the missing data in the dataset and then compare the means of the two groups using a pairwise t-test. Next, he suggests using a multivariate normal model to impute the missing data and then compare the means of the two groups. The results of the two methods lead to opposite conclusions. The t-test showed a significant result, but no significant difference was found when using the multivariate normal model for imputation. I thought this was an excellent example to illustrate the difference between multiple imputation methods, which account for the uncertainty in missing data and single imputation methods like regression imputation which do not account for this uncertainty.
In this blog, I pick another dataset for pairwise comparison using the t-test with regression imputation and multivariate normal imputation and discuss the results.
Dataset
I have used the Concrete Compressive Strength dataset available on the UC Irvine Machine Learning Repository website. The full dataset can be found here1.
1 Yeh, I. (1998). Concrete Compressive Strength. UCI Machine Learning Repository. https://doi.org/10.24432/C5PK67.
The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined in the laboratory. We will compare the compressive strength on day 3 vs day 7. The strength on day 7 is expected to be higher than day 3 with a very high probability due to the nature of concrete. So, instead of testing for difference in means, we will test the hypothesis that the mean strength on day 7 is higher than day 3 by a certain margin (say 10 MPa).
A subset of the dataset with only the relevant columns for this analysis has been extracted. The table below shows the concrete strength for three different ages (in days) for each sample - day 3, day 7 and day 28. We will only compare day 3 with day 7.
Let’s artificially introduce some missingness in the dataset to illustrate the imputation methods. Note that the missingness is Missing at Random (MAR). This means that the missingness is independent of the sample within a group (and the parameters of the multivariate normal model that we will fit).
Here is the same table, but after randomly deleting 30% of the values in each column.
| Age_3 | Age_7 | Age_28 |
|---|---|---|
| NA | 46.20179 | NA |
| 28.79941 | 42.79578 | NA |
| NA | 49.20101 | 60.29468 |
| 36.30091 | 46.80163 | 61.79773 |
| 28.99936 | NA | NA |
| 37.79707 | 55.59934 | 68.29949 |
| 40.19645 | 54.89608 | NA |
| 33.39822 | 49.20101 | 60.29468 |
| NA | 34.90128 | 50.69717 |
| 41.29961 | 46.89816 | 56.39914 |
| 33.39822 | NA | 60.29468 |
| 25.20035 | NA | 55.49592 |
| NA | 54.09629 | 68.49944 |
| 35.30117 | 55.89582 | NA |
| NA | 49.80085 | 74.69783 |
| 28.59946 | 47.09811 | 52.20023 |
| 35.30117 | NA | 71.29871 |
| NA | NA | 67.69965 |
| 35.30117 | 55.89582 | 71.29871 |
| 39.30013 | NA | 65.99664 |
| 40.59635 | 59.09499 | 74.49788 |
| 35.30117 | 22.89750 | NA |
| 24.09719 | 35.10122 | 49.89738 |
The goal is to test the following hypothesis:
\[\begin{aligned} H_0: \mu_{Day 7} - \mu_{Day 3} \leq 10 \\ H_a: \mu_{Day 7} - \mu_{Day 3} > 10 \end{aligned} \]
\(\mu_{Day 7}\) and \(\mu_{Day 3}\) are the population means of concrete compressive strength on day 7 and day 3 respectively.
Discussion
Three different methods will be used to analyze the data:
- Paired t-test after regression imputation
- Bayesian credible intervals after multivariate normal imputation with Jeffreys prior
- Bayesian credible intervals after multivariate normal imputation with Unit Information prior
Since we are using a Bayesian approach, we need to set a prior for the parameters of the multivariate normal model. Jeffreys prior is a non-informative prior that is invariant under reparameterization. Unit Information prior is a weakly informative prior that provides the same amount of information as one observation from the data. Both these priors are commonly used when there is limited prior information available about the parameters of the model2.
2 If the data does not contain sufficient information, the choice of prior (even when it is non-informative or weak) can have a significant impact on the posterior distribution. In such cases, it is important to carefully consider the choice of prior and its implications for the analysis. For the purpose of this blog, we will assume that the data contains sufficient information for the priors to have minimal impact on the posterior distribution.
Regression imputation involves building a linear regression model to predict the missing values in one variable using the other variable. This is done iteratively for both variables until all missing values are imputed. A paired t-test is then performed on the imputed dataset. Multivariate normal imputation used here involves assuming that the data follows a multivariate normal distribution and Bayesian methods are used to impute the missing values. The two different priors (Jeffreys and Unit Information) will lead to slightly different results.
The disadvantage of regression imputation is that it does not account for the uncertainty in the imputed values, leading to underestimated standard errors and potentially misleading results. The multivariate normal imputation, on the other hand, accounts for this uncertainty by generating multiple imputations from the posterior distribution of the parameters.
Figure 1 below shows the 95% lower bounds (upper bound is infinity) for the mean difference between concrete strength on day 7 (i.e Group 2) and day 3 (i.e Group 1) using the three different methods. The red dashed line indicates the threshold of 10 MPa. The pairwise t-test after regression imputation shows a lower bound of 10.24 MPa, which is above the threshold, leading to the rejection of the null hypothesis. We would conclude that we are 95% confident that the difference in strength is above 10 MPa. However, both Bayesian credible intervals after multivariate normal imputation (with Jeffreys prior and Unit Information prior) show lower bounds of 9.2 MPa and 9.51 MPa respectively, which are below the threshold, leading to the failure to reject the null hypothesis.
It is important to note the assumptions made while using the multivariate normal model for imputation. The data is assumed to be multivariate normally distributed and the missingness is assumed to be Missing at Random (MAR). Although these assumptions may not hold perfectly in practice, the multivariate normal model is often robust to moderate violations of these assumptions. However, large deviations from these assumptions can lead to biased estimates and incorrect inferences.
Also, we have used a Gibbs sampler to generate 10,000 samples from the posterior distribution. The convergence of the sampler should be checked using trace plots and auto-correlation plots to ensure that the samples are representative of the posterior distribution. Only then can we trust the results obtained from the analysis. A complete analysis would include these diagnostic checks, but they have been omitted here for brevity.
Conclusion
This blog illustrates the potentially different conclusions that could be drawn using single imputation vs multiple imputation. Single imputation methods like mean imputation and regression imputation do not account for uncertainty in the imputed values, leading to underestimated standard errors, artificially narrower intervals and potentially misleading results. Multiple imputation methods like the multivariate normal model (using a Bayesian approach in this case) account for this uncertainty by generating multiple imputations from the posterior distribution of the parameters, leading to more accurate estimates and inferences.
Although we have used the multivariate normal model imputation for the purposes of doing a pairwise comparison of means where we have only 2 groups, we can use the multivariate normal model imputation for datasets with more than 2 variables as well. The book previously mentioned contains examples of using the multivariate normal model imputation for datasets with more than 2 variables as well as details on the Gibbs sampler used for generating samples from the posterior distribution. The book by Schafer (1997)3 is another excellent resource for learning more about multivariate imputation and other multiple imputation methods. It discusses imputation methods in the presence of variables of different types (continuous, categorical, etc.) as well4.
3 Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Chapman and Hall/CRC. https://doi.org/10.1201/9781420041903.
4 We only discussed imputing continuous data using the multivariate normal model in this blog.
The next chapter in the Bayesian book that I am reading is on group comparisons and heirarchical modeling. I hope to write another blog in the near future discussing these topics as well. Stay tuned!
End
If you have any questions or comments, please feel free to write to me at shishir@calgaryanalyticsltd.com or reach out to me on LinkedIn.