The Question¶
“Can we describe this entire distribution with just 2 numbers?”
So far we have described distributions empirically — we plot the actual data. But empirical descriptions have a problem: they are specific to this sample. If we collect new data, the histogram shifts slightly.
A parametric model says: “this data comes from a known family of distributions, characterized by a small number of parameters.” If the model fits well, we can:
Describe the distribution in 2–3 numbers instead of thousands
Generate synthetic data
Compute probabilities analytically
Compare datasets on the same scale
The Exponential Distribution¶
When it appears: waiting times between events, inter-arrival times, survival times.
Parameters: one — the rate (events per unit time), or equivalently the mean .
Key property: memoryless. The probability of waiting another minutes is the same regardless of how long you’ve already waited. Like a coin flip — the past is irrelevant.
In NSFG: the time between pregnancies follows an approximately exponential distribution.
Detecting Exponential Shape¶
If , then on a log-y axis, the CDF becomes a straight line:
If the complementary CDF is linear on a log scale, the data is exponential.
The Normal Distribution¶
The most famous distribution in statistics. Also called the Gaussian distribution.
Parameters: mean and standard deviation .
Characterized by: bell shape, symmetric around the mean, 68-95-99.7 rule:
68% of data within 1 of the mean
95% within 2
99.7% within 3
In NSFG: birth weight is approximately normal. Adult heights are normal. Many measurement errors are normal (by the Central Limit Theorem — Chapter 14).
Normal Probability Plot¶
To check if data is normally distributed, plot the data against what you would expect if it were perfectly normal:
Sort your data:
Compute the expected normal quantiles: where is the inverse normal CDF
Plot vs
If the data is normal, the plot is a straight line. Curves indicate skew or heavy tails.
The Lognormal Distribution¶
If is normally distributed, then is lognormal.
When it appears: anything that is the product of many independent factors. Income, city populations, file sizes, biological growth rates.
In NSFG: birth weight is roughly normal, but adult body weight is more lognormal (long right tail — a few very heavy people, no symmetric left tail).
The Pareto Distribution¶
Named after economist Vilfredo Pareto. Describes the “80/20 rule” phenomenon.
Parameters: minimum value and shape .
When it appears: wealth, city sizes, earthquake magnitudes, word frequencies.
Key property: heavy right tail — extreme values are much more common than a normal distribution predicts. The mean may be infinite (when ).
Detecting Pareto shape: on a log-log plot, the CDF becomes linear.
Why Model?¶
Three reasons:
Compression: a fitted normal replaces thousands of data points with
Interpolation: estimate probabilities between observed values
Communication: “the data is approximately normal with , ” is instantly understood
But models can be wrong. Always check the fit visually. A model that fits poorly is worse than no model — it gives false confidence.
Exercises¶
Fit a normal distribution to birth weight. What are and ?
Make a normal probability plot for birth weight. Does the fit look good?
Plot the complementary CDF of inter-pregnancy intervals on a log scale. Is it exponential?
Fit a lognormal distribution to income (if available) or birth weight. Which fits better?
What fraction of babies weigh more than under the normal model? Verify against the data.
Glossary¶
parametric model — a family of distributions described by a fixed number of parameters
exponential distribution — models waiting times; memoryless; characterized by rate
normal distribution — bell-shaped, symmetric; characterized by mean and std
lognormal distribution — distribution whose logarithm is normal; right-skewed
Pareto distribution — heavy-tailed distribution; models 80/20 phenomena
normal probability plot — scatter plot of data quantiles vs normal quantiles; linear if normal
goodness of fit — how well a model matches the observed data
68-95-99.7 rule — for a normal distribution, 68/95/99.7% of data falls within 1/2/3 std devs