The Question¶
“We’ve been simulating everything. When can we use formulas instead?”
All previous chapters built understanding through simulation — we shuffled, resampled, and generated. But many textbooks give you formulas directly.
When are those formulas valid? And when does simulation win?
Normal Distributions — The Special Case¶
Many analytic results only hold when data is normally distributed.
If and independently, then:
Sums of normals are normal. This is special — it doesn’t hold for most distributions.
Sampling Distributions — The Key Results¶
If we draw samples of size from a population with mean and std :
Mean:
Difference in means:
This is how the classical two-sample t-test is derived — it assumes both populations are normal and uses the above formula for the sampling distribution.
The Central Limit Theorem (CLT)¶
The most important theorem in statistics.
Statement: For any population with finite mean and variance , as :
The sample mean is approximately normally distributed, regardless of the shape of the population distribution, as long as is large enough.
“Large enough” depends on skewness:
Symmetric population → is usually fine
Moderately skewed →
Heavily skewed →
Why does this matter? It is the reason normal-based formulas work in so many situations even when the data is not normal — we’re taking means of large samples, and means are approximately normal.
Testing the CLT¶
With NSFG birth weight (slightly left-skewed):
Population is NOT normal
Draw samples of size , compute sample mean
As increases, the distribution of sample means should approach normal
We can verify this by simulation — plot the sampling distribution of the mean for . Watch it converge to normal.
Applying the CLT — The t-test¶
For the first-baby pregnancy length question, instead of a permutation test (Chapter 9), we could use a two-sample t-test:
Under , this follows a t-distribution with approximately:
degrees of freedom (Welch’s approximation, no assumption of equal variance).
Does it give the same answer as the permutation test? Almost always yes, when is large. This validates both approaches.
Correlation Test¶
For testing whether Pearson’s is significantly different from zero:
Under :
We can compute this analytically instead of running a permutation test — and for large , the answers match.
Chi-Squared Test — Analytic Form¶
In Chapter 9 we introduced chi-squared for categorical data. The analytic version:
under , where is the number of categories.
When Simulation Beats Formulas¶
Use simulation (permutation test, bootstrap) when:
The data is heavily skewed or has outliers (CLT hasn’t kicked in yet)
The test statistic is not the mean (e.g., median, max, ratio)
You have small samples ()
The analytic formula requires assumptions you can’t verify
Use analytic methods when:
is large and data is not extremely skewed
You need speed (simulation is 1000× slower)
You want closed-form confidence intervals
You need to communicate to an audience that expects p-values and t-statistics
Exercises¶
Simulate the CLT: draw samples of size from birth weight. Plot sampling distributions of the mean.
Run a two-sample t-test for pregnancy length (first vs other). Compare to the permutation test p-value.
Compute the analytic 95% confidence interval for mean birth weight. Compare to the bootstrap CI.
Test the correlation between age and birth weight analytically. Same result as the permutation test?
For which NSFG variable does the sampling distribution of the mean converge slowest to normal? Why?
Glossary¶
Central Limit Theorem (CLT) — the sample mean is approximately normal for large , regardless of population shape
t-distribution — the sampling distribution of the mean when is unknown; approaches normal as
t-test — hypothesis test based on the t-statistic; assumes (approximately) normal sampling distribution
Welch’s approximation — formula for degrees of freedom in two-sample t-test without assuming equal variances
chi-squared distribution — distribution of under for categorical data
analytic method — a formula-based result derived from probability theory (vs simulation-based)
normal approximation — using a normal distribution to approximate the sampling distribution when the CLT applies
convergence in distribution — a sequence of distributions approaching a limit distribution