CRENC Learn library

A Beginner’s Guide to Building Better Explanatory Regression Models

Biostatistics Published March 12, 2026 11 min read

By Brice GUTHE, Gabriel Mabou, Peter Ebasone

Introduction

Regression is one of the most widely used analytical tools in statistics and data science for understanding relationships between variables and explaining real-world phenomena. In health research, it plays a central role for students, junior researchers, and senior investigators when identifying factors associated with clinical or public-health outcomes.

In practice, beginners often struggle with regression because it is not always clear what makes a model “good.” Common difficulties include obtaining contradictory results after adding or removing variables, coefficients that change direction or magnitude unexpectedly, wide confidence intervals that limit interpretation, and models that are hard to explain despite appearing statistically sound. These problems usually arise not from incorrect calculations, but from limited understanding of variable selection, parsimony, and the goals of regression modeling.

Despite its importance, regression is frequently misused. A common mistake is the inclusion of too many explanatory variables in a single model, under the assumption that more variables necessarily lead to better results. In practice, overly complex models often perform poorly, are difficult to interpret, and may generalize badly to new data.

A good model is therefore not the one that includes the largest number of variables, but the one that explains reality accurately with the least necessary complexity. This principle is known as parsimony.

1. What is regression ?

Regression is a statistical method used to study the relationship between a dependent variable (the outcome) and one or more independent (explanatory) variables, while accounting for the influence of other factors, including potential confounders. In practical terms, regression allows a more accurate estimation of the association between an explanatory variable (X) and an outcome (Y), by adjusting for other variables that may influence this relationship.

For example, comparing the performance of two surgeons solely based on crude surgical success rates may lead to misleading conclusions. Such a comparison ignores important factors such as case complexity, surgical technique, and years of experience. Regression makes it possible to adjust for these variables, allowing a fairer and more informative comparison between surgeons.

Depending on the nature of the dependent variable, several types of regression are commonly used:

Simple and multiple linear regression
Binary, ordinal, and multinomial logistic regression
Poisson regression
Cox proportional hazards regression

2. Explanatory versus Predictive Regression: Distinct Goals

Before evaluating model quality, it is essential to distinguish between two broad purposes of regression modeling: explanatory and predictive.

Explanatory models aim to understand relationships between variables. Their primary goals are interpretation, estimation of associations, and insight into underlying mechanisms. In this context, emphasis is placed on coefficient stability, plausibility, and clarity of interpretation. Parsimony is critical because overly complex models obscure understanding and weaken inference.

Predictive models, by contrast, aim to accurately predict outcomes for new observations. Their success is judged mainly by predictive performance, often evaluated using validation datasets and prediction error metrics. Such models may include many variables and interactions if they improve predictive accuracy, even at the expense of interpretability.

This blog focuses primarily on explanatory regression modeling, where the goal is not prediction, but understanding and communicating meaningful associations. The principles discussed below are therefore framed around interpretability, stability, and parsimony rather than predictive optimization.

3. Evaluation of the Quality of a Regression Model

The quality of a regression model is primarily assessed by its ability to explain or predict the outcome while maintaining interpretability and stability. Several performance indicators exist, among which the most commonly used are the coefficients of determination: R² and adjusted R².

R² (coefficient of determination) represents the proportion of variability in the dependent variable (Y) explained by the model. Its value ranges from 0 to 1, with higher values indicating better explanatory power. However, R² always increases when new variables are added, even if those variables contribute little meaningful information.

Adjusted R² addresses this limitation by accounting for the number of explanatory variables included in the model. It penalizes unnecessary predictors and therefore allows a fairer comparison between models with different levels of complexity.

The objective is not to maximize the number of variables, but to identify the optimal combination of explanatory variables (predictors) (Xi) that explain the outcome (Y) adequately without unnecessary complexity. Such a model is referred to as parsimonious: simple, interpretable, and robust, with a reduced risk of overfitting.

What is overfitting in explanatory models? In the context of explanatory regression, overfitting refers to a situation in which a model includes too many predictors relative to the available information, leading to unstable estimates, inflated standard errors, and coefficients that are highly sensitive to small changes in the data. Rather than clarifying the relationships between predictors and the outcome, an overfitted explanatory model obscures interpretation and weakens inferential validity, making it difficult to draw reliable conclusions about associations or underlying mechanisms.

4. How to obtain a parsimonious model?

Before discussing specific techniques, it is useful to keep a simple workflow in mind. In most applied analyses, building a parsimonious regression model follows four broad steps: (1) prepare the data and ensure an adequate number of events, (2) identify plausible variables based on theory and prior evidence, (3) reduce redundancy and overfitting through selection or penalization, and (4) evaluate model performance and interpretability.

4.1. Data preparation and the events‑per‑variable principle

Before performing any regression analysis, careful data preparation is important. This includes checking for sparse categories, zero counts, and implausible values that may compromise the stability of parameter estimates.

An important consideration is the ratio between the number of observed events (for example, cases with a positive outcome) and the number of explanatory variables included in the model. Vittinghoff et al. (2007) and van Smeden et al. (2016), emphasize the importance of this ratio in limiting overfitting and improving model validity. A commonly recommended rule of thumb is a minimum of five events per variable.

Example: If a study includes 20 observed events, no more than four explanatory variables should be entered into the model (20 ÷ 5 = 4). Exceeding this threshold increases the risk of unstable estimates and poor generalizability.

4.2. Variable selection strategies

Once initial data checks are complete, variable selection becomes the next critical step. Three main approaches are commonly used.

Manual selection

Manual selection relies on subject‑matter knowledge, clinical reasoning, and evidence from the literature. Variables are chosen based on prior evidence of association with the outcome rather than purely statistical considerations.

For example, when studying factors associated with loss to follow‑up among patients receiving antiretroviral therapy, variables such as distance to the health facility, income level, or perceived stigma are more plausible candidates than biologically unrelated variables like blood type.

An initial bivariate screening is often conducted using appropriate statistical tests, such as:

The Chi-square test – used to assess the association between categorical dependent and independent variables.
Spearman, Pearson, or Kendall correlation tests – used to assess the relationship between quantitative or ordinal dependent and independent variables, depending on data distribution and measurement scale.

Because this step is exploratory, a more lenient significance threshold is typically used, often around 10% (p < 0.10) or using relaxed p-value cutoffs such as p < 0.20 or p < 0.25, following the purposeful selection strategy described by Hosmer and colleagues in Applied Logistic Regression. They recommend avoiding the premature exclusion of potentially important variables, as some variables that are not statistically significant in univariate analysis may become significant after adjustment for confounding factors in the multivariable model. They further recommend initially including variables with a p-value < 0.25 in univariate analysis, and then retaining in the multivariable model those that remain statistically significant at the conventional threshold of p < 0.05.

Table 1: An example of candidate variable selection.

Variable	Total (n)	Lost to follow-up n (%)	Retained in care n (%)	Crude OR (95% CI)	p-value
Sex
Male	120	35 (29.2)	85 (70.8)	1.45 (0.85–2.48)	0.17
Female	180	40 (22.2)	140 (77.8)	1.00	—
Age group
< 30 years	90	30 (33.3)	60 (66.7)	2.10 (1.15–3.85)	0.08*
≥ 30 years	210	45 (21.4)	165 (78.6)	1.00	—
Education level
Secondary or less	150	50 (33.3)	100 (66.7)	2.50 (1.45–4.30)	0.001*
Higher education	150	25 (16.7)	125 (83.3)	1.00	—
* Candidate variables selected at a 10% (p < 0,1) threshold

Following this step, collinearity between explanatory variables must be assessed. Strong correlations indicate redundant information and may destabilize the model.

For example, if weight and BMI are included in the same model, their strong correlation can lead to multicollinearity. In this case, BMI might be preferred, as it already incorporates weight in its calculation.

Common tools include Pearson, Spearman, or Kendall correlation coefficients, as well as the Variance Inflation Factor (VIF). A VIF value above 10 generally suggests problematic multicollinearity. When strong collinearity is detected, one variable should be retained based on clinical or contextual relevance, or a composite variable may be constructed.

**Figure 1:** Complete process of manual selection

i. Automatic selection

Automatic selection relies on algorithmic procedures that evaluate the contribution of each variable to overall model fit, most often through the likelihood function or related information criteria.

Forward selection starts with an empty model and adds variables one at a time, selecting at each step the variable whose inclusion leads to the greatest improvement in model fit.
Backward selection begins with a full model that includes all candidate variables and then removes variables sequentially, eliminating those whose exclusion has the smallest impact on model fit.
Stepwise selection combines both approaches, allowing variables to be added or removed iteratively based on predefined criteria.

For example, consider a logistic regression model aimed at identifying factors associated with loss to follow-up among patients on antiretroviral therapy. Suppose the initial set of candidate variables includes age, sex, education level, distance to the health facility, employment status, and perceived stigma.

In a forward selection approach, the model may first include distance to the health facility because it provides the largest improvement in likelihood. In the next step, employment status may be added, followed by perceived stigma, until no additional variable meaningfully improves model fit. The final model may therefore include only three predictors, even though six were initially considered.
In contrast, a backward selection approach would start with all six variables in the model and progressively remove those that contribute the least. For example, sex and age may be dropped early if their removal does not substantially worsen model fit, leading again to a more compact model.

These methods are widely implemented in statistical software packages such as SPSS and can be useful for exploratory analyses. However, because they rely heavily on statistical criteria and may ignore clinical or contextual relevance, their results should be interpreted cautiously and ideally complemented by subject-matter knowledge.

Figure 2 shows the SPSS dialog for binary logistic regression. After loading the dataset, the interface is accessed via Analyze → Regression → Binary Logistic Regression. The outcome is specified in the Dependent field and candidate predictors in Covariates. The Method menu defines the variable selection strategy. By default, Enter includes all selected variables simultaneously (no selection). Forward and backward options implement automatic selection based on likelihood ratio, Wald, or conditional criteria, corresponding to the procedures described in the text.

**Figure 2:** Automatic variable selection in SPSS (binary logistic regression).

ii. Penalized regression methods

Advanced techniques such as Lasso, Ridge, and Elastic Net regression are designed to control model complexity by adding a penalty to the regression coefficients. In simple terms, these methods discourage the model from assigning large effects to many variables at the same time.

Why is this useful? In explanatory analyses, including too many variables or highly correlated predictors can destabilize coefficient estimates, inflate uncertainty, and complicate interpretation. Penalization helps limit unnecessary complexity, producing more stable and interpretable estimates, even when predictors are correlated.

Ridge regression shrinks the coefficients of correlated variables toward zero, reducing instability without removing variables entirely.
Lasso regression goes a step further by shrinking some coefficients exactly to zero, effectively performing automatic variable selection.
Elastic Net combines both approaches, making it useful when many predictors are correlated.

These methods are used mainly in predictive modeling, but they can also be informative in exploratory explanatory analyses when the number of candidate predictors is large. However, because penalization alters coefficient estimates, results should be interpreted cautiously when the primary goal is causal explanation. While they often produce stable and accurate models, the resulting coefficients should be interpreted cautiously, especially when drawing causal conclusions.

Common Pitfalls to Avoid

When building regression models, several recurrent mistakes should be avoided:

Including variables solely because they are statistically significant in bivariate analysis
Ignoring collinearity between predictors
Relying on automatic selection methods without theoretical justification
Interpreting coefficients from heavily penalized models as causal effects

Conclusion

A high‑quality regression model is not one that fits the data perfectly, but one that explains reality accurately using the fewest necessary variables. Parsimony is therefore central to building models that are interpretable, stable, and generalizable.

When in doubt, a simpler model that can be clearly explained and defended is often preferable to a complex model whose assumptions and results are difficult to justify.

References

Chesneau, C. (2015, 7 novembre). Modèles de régression. Université de Caen Basse-Normandie. Consulté le 26/07/2025 sur http://www.math.unicaen.fr/~chesneau/
Wikistat. Sélection de modèle en régression linéaire. Consulté le 26/07/2025 sur https://www.math.univ-toulouse.fr/~besse/Wikistat/pdf/st-m-app-linSelect.pdf
Vittinghoff E, McCulloch CE. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol. 2007 Mar 15;165(6):710-8. doi: 10.1093/aje/kwk052. Epub 2006 Dec 20. PMID: 17182981
van Smeden M, de Groot JA, Moons KG, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol. 2016 Nov 24;16(1):163. doi: 10.1186/s12874-016-0267-3. PMID: 27881078; PMCID: PMC5122171.
Legrand P, Bories D. Le choix des variables explicatives dans les modèles de régression logistique. Communication présentée aux Journées de l’AIMS 2007, mai 2007. Disponible sur: https://www.researchgate.net/publication/281834969
Bursac, Z., Gauss, C. H., Williams, D. K., & Hosmer, D. W. (2008). Purposeful selection of variables in logistic regression. BMC Medical Research Methodology, 8, 17. https://doi.org/10.1186/1471-2288-8-17
Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. 3rd ed. New York: Wiley; 2013.

About the authors

Brice GUTHE

Brice GUTHE (MPHE) is an epidemiologist and public health expert, specialized in infectious diseases and particularly committed to issues related to HIV. He also has a strong interest in data science, research ethics, and mapping.

Gabriel Mabou

Gabriel Mabou (MPH, MSc, MA) is an epidemiologist and public health specialist. He is a CRENC fellow and former lead of the research ethics and data management unit in the same organization. His work now focuses on project management and activities supervision.

Peter Ebasone

Dr Ebasone (MD, PhD) est le Directeur des Opérations de Recherche au CRENC, où il coordonne le programme International Epidemiology Databases to Evaluate AIDS (IeDEA) au Cameroun et supervise le programme d’e-learning.

1 thought on “A Beginner’s Guide to Building Better Explanatory Regression Models”

Tankeu Gilles
March 12, 2026 at 12:18 pm

Very interesting.
Thanks. Good job.

Loading...

Reply

A Beginner’s Guide to Building Better Explanatory Regression Models