The Limits of PMFs¶
PMFs work well for discrete data with few unique values (like pregnancy length in weeks). But for continuous data — birth weight in pounds — there can be hundreds of distinct values, and plotting every one gives a noisy, unreadable picture.
We need a representation that:
Works for any data (discrete or continuous)
Makes it easy to compare distributions visually
Supports questions like “what percentile is 8 lbs?”
The Cumulative Distribution Function (CDF) does all three.
What is a CDF?¶
The CDF answers one question: what fraction of the data is at or below a given value?
For empirical data, this is simply:
The CDF is always:
Non-decreasing (once a value is “counted”, it stays counted)
Between 0 and 1
Starts at 0, ends at 1
Building a CDF from Scratch¶
def make_cdf(values):
sorted_values = sorted(v for v in values if not np.isnan(v))
n = len(sorted_values)
cumulative_prob = [(v, (i + 1) / n) for i, v in enumerate(sorted_values)]
return cumulative_probEvery value in the dataset gets a rank. The CDF at value is that rank divided by .
Percentiles¶
A percentile is the inverse CDF: given a probability , find the value such that .
50th percentile = median — half the data is below this value
25th percentile = first quartile (Q1)
75th percentile = third quartile (Q3)
IQR = Q3 − Q1 = the middle 50% of the data
For birth weight: the median is around 7.4 lbs. The 25th percentile is ~6.6 lbs, the 75th is ~8.1 lbs. So the typical newborn weighs between 6.6 and 8.1 lbs.
Comparing CDFs¶
The real power of the CDF: two CDFs on the same plot tell the whole story.
For pregnancy length:
If the first-baby CDF is consistently to the right of the other-baby CDF, first babies are uniformly longer
If the CDFs cross, the distributions differ in shape, not just location
When you plot the two CDFs for pregnancy length, they are nearly identical. The 13-hour difference is truly tiny — the CDFs almost overlap completely.
Percentile-Based Statistics¶
The median is more robust than the mean: one outlier doesn’t shift it much.
For skewed data (like income, or birth weight which has a long left tail for premature babies), the median is usually a better summary than the mean.
| Statistic | Formula | Sensitive to outliers? |
|---|---|---|
| Mean | Yes | |
| Median | value at 50th percentile | No |
| IQR | Q3 − Q1 | No |
| Std dev | Yes |
Random Numbers from a CDF¶
A useful trick: if is a uniform random number between 0 and 1, then has the same distribution as the original data.
This lets us simulate new data that matches an observed distribution:
# Generate 1000 synthetic birth weights matching NSFG
u = np.random.uniform(0, 1, size=1000)
synthetic = [cdf_inverse(birth_weight_cdf, ui) for ui in u]This is the foundation of the bootstrap (Chapter 8).
Comparing Percentile Ranks¶
If you know a baby’s birth weight, you might want to know: what percentile is this?
# What fraction of babies weigh less than 8.0 lbs?
rank = cdf_value(birth_weight_cdf, 8.0)
print(f"A baby weighing 8.0 lbs is at the {rank*100:.1f}th percentile")This is exactly what doctors use for growth charts.
Exercises¶
Build a CDF for birth weight (
totalwgt_lb) for all live births.What is the median birth weight? What is the IQR?
Plot CDFs for first vs other babies on the same axes. What do you notice?
Write a function
percentile_rank(cdf, value)that returns the percentile of a given value.Write a function
percentile(cdf, p)that returns the value at a given percentile.Use the inverse CDF to generate 500 synthetic birth weights. Plot them vs the original CDF.
Glossary¶
CDF (Cumulative Distribution Function) — ; fraction of data at or below
percentile — the value below which a given percentage of observations fall
median — the 50th percentile; robust to outliers
quartile — 25th, 50th, or 75th percentile
IQR (Interquartile Range) — Q3 − Q1; the spread of the middle 50% of the data
inverse CDF — given a probability , returns the value such that
robust statistic — a statistic that is not strongly affected by outliers