The Question¶
“Pregnancy length partly predicts birth weight. What else does?”
Chapter 10 used one predictor. Real phenomena are shaped by many variables. Multiple regression lets us use all of them simultaneously.
Multiple Regression¶
We extend the simple model to include multiple predictors:
For birth weight we might include:
= pregnancy length (weeks)
= mother’s age (years)
= birth order (1st, 2nd, ...)
Matrix form:
OLS solution:
Interpreting Multiple Regression Coefficients¶
Each coefficient is the change in per unit change in , holding all other variables constant.
This is critical: in simple regression, the age coefficient captures both the direct effect of age AND any confounds correlated with age. In multiple regression, each coefficient is adjusted for the others.
Example: in simple regression, pregnancy length predicts birth weight. But longer pregnancies and heavier babies might both be correlated with mother’s health. Multiple regression can separate these.
Nonlinear Relationships in Regression¶
Chapter 7 showed that mother’s age has a nonlinear relationship with birth weight — teens and older mothers have lighter babies, peak in the 30s.
We can model this with a quadratic term:
The model is still linear in the coefficients — we just created a new variable . This is called polynomial regression.
Logistic Regression — Binary Outcomes¶
What if the outcome is binary — e.g., preterm birth (yes/no)?
We can’t use linear regression for binary outcomes (it predicts probabilities outside [0,1]). Instead, we use logistic regression, which models the log-odds:
Solving for :
This is the sigmoid function — the same one used in neural networks.
Coefficients: is the odds ratio for variable .
Using statsmodels¶
We use statsmodels (not sklearn) because it gives us:
Full statistical output (p-values, confidence intervals, F-statistics)
The same interface as R — important for academic work
AIC/BIC for model comparison
import statsmodels.formula.api as smf
model = smf.ols('totalwgt_lb ~ prglngth + agepreg + birthord', data=df).fit()
print(model.summary())Model Comparison¶
How do we know if adding a predictor improves the model?
increases when you add any variable (even useless noise) — use adjusted
AIC / BIC penalize complexity — lower is better; use to compare models
F-test tests whether a group of coefficients is jointly zero
Adjusted :
where is the number of predictors.
Exercises¶
Fit multiple regression:
totalwgt_lb ~ prglngth + agepreg + birthord. Report coefficients.Add a quadratic age term (
agepreg**2). Does adjusted improve?Fit logistic regression predicting preterm birth (prglngth < 37 weeks) from mother’s age and birth order.
What is the odds ratio for birth order on preterm birth?
Compare AIC of the simple model (prglngth only) vs the full model.
Glossary¶
multiple regression — regression with more than one predictor
coefficient — : change in outcome per unit change in , holding others constant
polynomial regression — adding squared/cubic terms to model nonlinear relationships
logistic regression — regression for binary outcomes; models log-odds
sigmoid function — ; maps any value to (0,1)
odds ratio — ; multiplicative change in odds per unit increase in
adjusted — penalized for number of predictors; prevents overfitting
AIC — Akaike Information Criterion; model quality + complexity penalty; lower = better
BIC — Bayesian Information Criterion; heavier complexity penalty than AIC