Single Imputation vs Multivariate Normal Imputation

A Demonstration using a Bayesian Approach

Author

Shishir Rao

Published

December 23, 2025

Introduction

Hoff, P. D. (2009). A First Course in Bayesian Statistical Methods. Springer. https://doi.org/10.1007/978-0-387-92407-6

Recently, I have been revisiting my notes from the course on Applied Bayesian Methods that I took in grad school. There is a chapter in the book (see margin) on the multivariate normal model. One of the sections in the chapter discusses missing data and how to impute them using the multivariate normal model. An exercise problem at the end of the chapter asks to conduct a pairwise comparison of the means of two groups which contain missing data. First, the author suggests using regression imputation to impute the missing data in the dataset and then compare the means of the two groups using a pairwise t-test. Next, he suggests using a multivariate normal model to impute the missing data and then compare the means of the two groups. The results of the two methods lead to opposite conclusions. The t-test showed a significant result, but no significant difference was found when using the multivariate normal model for imputation. I thought this was an excellent example to illustrate the difference between multiple imputation methods, which account for the uncertainty in missing data and single imputation methods like regression imputation which do not account for this uncertainty.

In this blog, I pick another dataset for pairwise comparison using the t-test with regression imputation and multivariate normal imputation and discuss the results.

Dataset

I have used the Concrete Compressive Strength dataset available on the UC Irvine Machine Learning Repository website. The full dataset can be found here ¹.

¹ Yeh, I. (1998). Concrete Compressive Strength. UCI Machine Learning Repository. https://doi.org/10.24432/C5PK67.

The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined in the laboratory. We will compare the compressive strength on day 3 vs day 7. The strength on day 7 is expected to be higher than day 3 with a very high probability due to the nature of concrete. So, instead of testing for difference in means, we will test the hypothesis that the mean strength on day 7 is higher than day 3 by a certain margin (say 10 MPa).

A subset of the dataset with only the relevant columns for this analysis has been extracted. The table below shows the concrete strength for three different ages (in days) for each sample - day 3, day 7 and day 28. We will only compare day 3 with day 7.

Table 1. Concrete Compressive Strength (MPa) at Different Ages (Days)
Age_3	Age_7	Age_28
34.39796	46.20179	61.09447
28.79941	42.79578	59.79825
33.39822	49.20101	60.29468
36.30091	46.80163	61.79773
28.99936	45.69847	56.69561
37.79707	55.59934	68.29949
40.19645	54.89608	66.89986
33.39822	49.20101	60.29468
28.09615	34.90128	50.69717
41.29961	46.89816	56.39914
33.39822	49.20101	60.29468
25.20035	33.39822	55.49592
41.09966	54.09629	68.49944
35.30117	55.89582	71.29871
28.29610	49.80085	74.69783
28.59946	47.09811	52.20023
35.30117	55.89582	71.29871
24.40056	37.99702	67.69965
35.30117	55.89582	71.29871
39.30013	56.09577	65.99664
40.59635	59.09499	74.49788
35.30117	22.89750	71.29871
24.09719	35.10122	49.89738

Let’s artificially introduce some missingness in the dataset to illustrate the imputation methods. Note that the missingness is Missing at Random (MAR). This means that the missingness is independent of the sample within a group (and the parameters of the multivariate normal model that we will fit).

Here is the same table, but after randomly deleting 30% of the values in each column.

Table 2. Concrete Compressive Strength (MPa) at Different Ages (Days) with Missing Values
Age_3	Age_7	Age_28
NA	46.20179	NA
28.79941	42.79578	NA
NA	49.20101	60.29468
36.30091	46.80163	61.79773
28.99936	NA	NA
37.79707	55.59934	68.29949
40.19645	54.89608	NA
33.39822	49.20101	60.29468
NA	34.90128	50.69717
41.29961	46.89816	56.39914
33.39822	NA	60.29468
25.20035	NA	55.49592
NA	54.09629	68.49944
35.30117	55.89582	NA
NA	49.80085	74.69783
28.59946	47.09811	52.20023
35.30117	NA	71.29871
NA	NA	67.69965
35.30117	55.89582	71.29871
39.30013	NA	65.99664
40.59635	59.09499	74.49788
35.30117	22.89750	NA
24.09719	35.10122	49.89738

The goal is to test the following hypothesis:

\[\begin{aligned} H_0: \mu_{Day 7} - \mu_{Day 3} \leq 10 \\ H_a: \mu_{Day 7} - \mu_{Day 3} > 10 \end{aligned} \]

\(\mu_{Day 7}\) and \(\mu_{Day 3}\) are the population means of concrete compressive strength on day 7 and day 3 respectively.

Discussion

Three different methods will be used to analyze the data:

Paired t-test after regression imputation
Bayesian credible intervals after multivariate normal imputation with Jeffreys prior
Bayesian credible intervals after multivariate normal imputation with Unit Information prior

Since we are using a Bayesian approach, we need to set a prior for the parameters of the multivariate normal model. Jeffreys prior is a non-informative prior that is invariant under reparameterization. Unit Information prior is a weakly informative prior that provides the same amount of information as one observation from the data. Both these priors are commonly used when there is limited prior information available about the parameters of the model².

² If the data does not contain sufficient information, the choice of prior (even when it is non-informative or weak) can have a significant impact on the posterior distribution. In such cases, it is important to carefully consider the choice of prior and its implications for the analysis. For the purpose of this blog, we will assume that the data contains sufficient information for the priors to have minimal impact on the posterior distribution.

Regression imputation involves building a linear regression model to predict the missing values in one variable using the other variable. This is done iteratively for both variables until all missing values are imputed. A paired t-test is then performed on the imputed dataset. Multivariate normal imputation used here involves assuming that the data follows a multivariate normal distribution and Bayesian methods are used to impute the missing values. The two different priors (Jeffreys and Unit Information) will lead to slightly different results.

The disadvantage of regression imputation is that it does not account for the uncertainty in the imputed values, leading to underestimated standard errors and potentially misleading results. The multivariate normal imputation, on the other hand, accounts for this uncertainty by generating multiple imputations from the posterior distribution of the parameters.

Figure 1 below shows the 95% lower bounds (upper bound is infinity) for the mean difference between concrete strength on day 7 (i.e Group 2) and day 3 (i.e Group 1) using the three different methods. The red dashed line indicates the threshold of 10 MPa. The pairwise t-test after regression imputation shows a lower bound of 10.24 MPa, which is above the threshold, leading to the rejection of the null hypothesis. We would conclude that we are 95% confident that the difference in strength is above 10 MPa. However, both Bayesian credible intervals after multivariate normal imputation (with Jeffreys prior and Unit Information prior) show lower bounds of 9.2 MPa and 9.51 MPa respectively, which are below the threshold, leading to the failure to reject the null hypothesis.

Figure 1: Mean Difference between Concrete Strength on Day 7 and Day 3 using different methods

It is important to note the assumptions made while using the multivariate normal model for imputation. The data is assumed to be multivariate normally distributed and the missingness is assumed to be Missing at Random (MAR). Although these assumptions may not hold perfectly in practice, the multivariate normal model is often robust to moderate violations of these assumptions. However, large deviations from these assumptions can lead to biased estimates and incorrect inferences.

Also, we have used a Gibbs sampler to generate 10,000 samples from the posterior distribution. The convergence of the sampler should be checked using trace plots and auto-correlation plots to ensure that the samples are representative of the posterior distribution. Only then can we trust the results obtained from the analysis. A complete analysis would include these diagnostic checks, but they have been omitted here for brevity.

Conclusion

This blog illustrates the potentially different conclusions that could be drawn using single imputation vs multiple imputation. Single imputation methods like mean imputation and regression imputation do not account for uncertainty in the imputed values, leading to underestimated standard errors, artificially narrower intervals and potentially misleading results. Multiple imputation methods like the multivariate normal model (using a Bayesian approach in this case) account for this uncertainty by generating multiple imputations from the posterior distribution of the parameters, leading to more accurate estimates and inferences.

Although we have used the multivariate normal model imputation for the purposes of doing a pairwise comparison of means where we have only 2 groups, we can use the multivariate normal model imputation for datasets with more than 2 variables as well. The book previously mentioned contains examples of using the multivariate normal model imputation for datasets with more than 2 variables as well as details on the Gibbs sampler used for generating samples from the posterior distribution. The book by Schafer (1997)³ is another excellent resource for learning more about multivariate imputation and other multiple imputation methods. It discusses imputation methods in the presence of variables of different types (continuous, categorical, etc.) as well⁴.

³ Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Chapman and Hall/CRC. https://doi.org/10.1201/9781420041903.

⁴ We only discussed imputing continuous data using the multivariate normal model in this blog.

The next chapter in the Bayesian book that I am reading is on group comparisons and heirarchical modeling. I hope to write another blog in the near future discussing these topics as well. Stay tuned!

End

If you have any questions or comments, please feel free to write to me at shishir@calgaryanalyticsltd.com or reach out to me on LinkedIn.