Procedure for Handling Missing Data in Statistical Analysis

by Brice GUTHE, Gabriel Mabou, Peter Ebasone | Nov 24, 2025 | Biostatistics

A problem frequently encountered by many researchers and unfortunately not well documented is how to handle missing data within a dataset to ensure optimal analysis. Indeed, after a rigorous process of field data collection, the quality of the dataset is not always guaranteed due to several factors that may not depend on the researcher (faulty equipment, insufficient training of interviewers, etc.). These issues often lead to inconsistencies and incompleteness in the database.

1. Types of Missing Data

1.1 Missing Completely at Random (MCAR):

Data are said to be missing completely at random when the absence of a value depends neither on observed nor unobserved variables. In other words, the reasons for which certain data are missing are entirely independent of the characteristics of the individuals or the parameters being studied, and result purely from chance. When this condition is met, analyses performed on the remaining data are unbiased. However, this scenario is rarely encountered in practice. In the example (see Figure 1), we observe that missing values of Systolic Blood Pressure (SBP) are unrelated to any other characteristic, as both men and women, young and old alike, are affected.

1.2 Missing at Random (MAR):

This occurs when missing observations are not completely random but can be fully explained by variables for which complete information exists. For example, unlike MCAR, we may notice that missing SBP values tend to occur among individuals under 30 years of age.

1.3 Missing Not at Random (MNAR):

Also known as non-ignorable or non-response data, MNAR occurs when the value of the missing variable is related to the reason why it is missing. For instance, individuals without recorded SBP values might be those who arrived with high blood pressure in emergency situations, where the measurement could not be taken immediately.

2. Approaches for Handling Missing Data

Thus, to address the problem of missing values, we can use:

Deletion Methods
Imputation Methods

2.1 Deletion Methods (Complete Case Analysis)

This method, also called Complete Case Analysis, is one of the simplest. It involves identifying and removing observations (rows) that contain missing values (Figure 2). However, it is recommended only when the proportion of missing data is very small (less than 5% of the total population), to avoid biasing the dataset. For example, in a dataset of 100 individuals describing their sociodemographic profiles, if the ages of 25 participants (25% of the total) are missing, deleting these cases would lead to a significant loss of information. In such cases, imputation methods are preferred.

Figure 2 : Handling Missing Values through Complete Case Analysis

2.2 Simple Imputation Methods

Imputation methods involve replacing missing values with the mean or median of the series, or with values generated through more sophisticated techniques such as extrapolation or iterative Principal Component Analysis (PCA).

If the series is normally distributed, missing values are replaced by the mean; otherwise, by the median. For qualitative variables, the mode (most frequent category) is used. In our previous example, the distribution of the 75 observed ages is examined. If normally distributed, the 25 missing ages are replaced by the mean; if not, by the median. Although commonly used, this method tends to reduce variability by clustering values around the mean or median, potentially introducing bias. In Figure 3, missing “sex” values are replaced by the most frequent category (female), and quantitative variables like age and SBP are replaced by their respective means (38 and 113).

2.3 Multiple Imputation

This is currently the most widely used and recommended method in modern statistical analysis. It replaces missing values several times with plausible values generated from a probabilistic model. Each completed dataset is analyzed separately, and the results are combined to produce a more robust final estimate (Figure 3). This approach preserves the natural variability of the data and minimizes bias. Unlike simple imputation, it does not rely on a single fixed estimate but accounts for uncertainty in the estimation process.

*Figure 3. Overview of the multiple imputation process, from incomplete data to pooled estimates using Rubin’s rules.*

2.4 Other imputation techniques

2.4.1 Interpolation and Extrapolation Methods

These are deterministic approaches that predict missing values based on relationships or correlations between variables. For example, if age correlates with height, taller individuals might be assigned higher ages, while shorter ones are assigned lower ages. The most common form is linear extrapolation. The main limitation is that for many variables, this process becomes tedious, requiring pairwise analysis this is where iterative PCA becomes useful..

2.4.2 Iterative Principal Component Analysis (PCA)

PCA is an exploratory data analysis technique that helps visualize relationships among several quantitative variables across different dimensions. Iterative PCA estimates missing values by repeating the PCA process several times, each time using results from the previous iteration to refine imputations, until the best dataset is obtained.

This process is available in several statistical software tools. Because the algorithm can iterate indefinitely, researchers must specify a maximum number of iterations.

3. Practical Example with SPSS

Once the data set is imported into SPSS and variables with completeness issues are identified, go to the Transform menu and find the option that allows you to perform this operation.

Choose the appropriate imputation method; in this case, we will choose mean imputation.

Select the variable to be processed and assign a name to the new variable.

The result is the following: all missing values have been replaced by the series mean, and this has been done in a new variable named exper_1.

Conclusion

Proper management of missing values is essential to ensure the validity and reliability of statistical analyses. A rigorous methodological approach that incorporates appropriate imputation techniques while accounting for data characteristics enhances research quality and increases confidence in conclusions. Although the choice of method depends on the nature of the missing data, Complete Case Analysis and Multiple Imputation remain the most widely used.

References

1. insightsoftware. How to handle missing data values while data cleaning. insightsoftware. 2023. https://insightsoftware.com/fr/blog/how-to-handle-missing-data-values-while-data-cleaning/. Accessed April 14, 2025.

2. Medistica. pvalue.io, a GUI of R statistical software for scientific medical publications. pvalue.io. 2019. https://www.pvalue.io. Accessed April 14, 2025.

3. Wikistat. Imputation of missing data.. https://www.math.univ-toulouse.fr/~besse/Wikistat/pdf/st-m-app-idm.pdf. 2025.

4. Expert. Handling missing data: best practices for 2024. Editverse. 2024. https://www.editverse.com/fr/meilleures-pratiques-de-gestion-des-donn%C3%A9es-manquantes-pour-les-chercheurs-en-2024 . Accessed 15 Apr 2025.

5. Ebasone, P.V, Peer N, Dzudie A, et al. (2025). Reporting and handling of missing data in published studies of co-morbid hypertension and diabetes among people living with HIV/AIDS: a systematic review. BMC Medical Research Methodology, 25, 180.
https://doi.org/10.1186/s12874-025-02630-1

6.Peter Ebasone (2025). Handling missing data in practice:Complete Case Analysis vs Multiple Imputation. CRENC [See Slides].

Authors

Brice GUTHE

Brice GUTHE (MPHE) is an epidemiologist and public health expert, specialized in infectious diseases and particularly committed to issues related to HIV. He also has a strong interest in data science, research ethics, and mapping.
Gabriel Mabou

Gabriel Mabou (MPH, MSc, MA) is an epidemiologist and public health specialist. He is a CRENC fellow and former lead of the research ethics and data management unit in the same organization. His work now focuses on project management and activities supervision.
Peter Ebasone

Dr. Ebasone (MD, PhD) is the Director of Research Operations at CRENC, where he coordinates the International Epidemiology Databases to Evaluate AIDS (IeDEA) in Cameroon and oversees the e-learning program.

← Unveiling the Fundamentals of Implementation Science

1 Comment

Korin Neh on November 26, 2025 at 8:37 pm

This is very pertinent and relevant in research and data handling. I hereby congratulate the initiative.. Thanks to the team

Loading...

Reply

Receive updates on new courses and blog posts

Procedure for Handling Missing Data in Statistical Analysis

1. Types of Missing Data

1.1 Missing Completely at Random (MCAR):

1.2 Missing at Random (MAR):

1.3 Missing Not at Random (MNAR):

2. Approaches for Handling Missing Data

2.1 Deletion Methods (Complete Case Analysis)

2.2 Simple Imputation Methods

2.3 Multiple Imputation

2.4 Other imputation techniques

3. Practical Example with SPSS

Conclusion

References

Authors

Post Navigation

1 Comment

Leave a ReplyCancel reply

Receive updates on new courses and blog posts

Success!

Details d’Enregistrement

Nous contacter

Suivez nous

Never Miss a Thing!

You have Successfully Subscribed!