Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

0 — From Measurements to Meaning

DhvaniAI

11. The Clean World

Forget models for a moment. You are an engineer with a sensor.

Formally: we have an input xx (what we control, or what identifies the source) and a measurement yy we read off the sensor. There is some relationship

y=f(x)y = f(x)

where ff is whatever the underlying physics dictates — linear, exponential, wavy, arbitrary.

Two examples to fix the idea:

xx (source)ffyy (measurement)
ThermometerTrue room temperatureLinear scaling (sensor response)Reading on the display
Camera pixelLight hitting the surface (scene radiance)Lens + sensor transfer functionGray-level pixel value

In both cases, a perfect sensor would give you f(x)f(x) exactly — no guesswork, no error. The thermometer would read the true temperature; the pixel would perfectly encode the true brightness.

In a perfect world, machine learning wouldn’t exist. If you knew ff, you’d just plug in xx and read off yy. No learning, no estimation, no uncertainty. The entire ML and signal-processing industry hinges on everything that comes next.

import numpy as np
import matplotlib.pyplot as plt

x_grid = np.linspace(0, 10, 200)
true_f = lambda x: 2.0 * x + 0.5      # perfect linear sensor
y_clean = true_f(x_grid)

fig, ax = plt.subplots(figsize=(8, 4.5))
ax.plot(x_grid, y_clean, linewidth=2.5, label='y = f(x) (the truth)')
ax.set_xlabel('x (input / knob setting)')
ax.set_ylabel('y (sensor reading)')
ax.set_title('The clean world — perfect sensor, no ML needed')
ax.legend()
The clean world: a perfect sensor reading out the exact value of f(x).

22. Modelling the World

Look back at what we just did. We pointed at “a thermometer in a room” and wrote y=f(x)y = f(x). The room is not an equation. The thermometer is not an equation. We replaced a physical situation with a symbolic stand-in — a few letters and an equals sign — and then promised ourselves we’d do all our reasoning inside the symbols.

That swap has a name. It is called mathematical modelling, and it is the move that makes engineering and science possible. You cannot compute on a room. You can compute on y=f(x)y = f(x). Every prediction ever made — by a calibration curve, by a Kalman filter, by GPT-5 — is a calculation performed inside a model and then projected back onto the world.

A worked example: a falling ball

Drop a ball from a height h0h_0. We want to know how high it is at time tt. From classical mechanics, with gravity g9.81m/s2g \approx 9.81 \, \text{m/s}^2:

h(t)=h012gt2h(t) = h_0 - \tfrac{1}{2} g t^2

That single line is a model. It compresses the entire physical situation into:

import numpy as np
import matplotlib.pyplot as plt

g = 9.81                                  # m/s^2
h0 = 10.0                                 # release height in metres
t = np.linspace(0, np.sqrt(2 * h0 / g), 200)
h = h0 - 0.5 * g * t**2

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(t, h, linewidth=2.5)
ax.set_xlabel('time t (seconds)')
ax.set_ylabel('height h (metres)')
ax.set_title('A model of a falling ball: h(t) = h₀ − ½ g t²')
ax.grid(True, alpha=0.3)

What does the model give you?

What this model throws away

Every model is a deliberate lie. This one is missing:

These omissions are not bugs. They are the model’s whole point. Including everything would give you back the world, which is precisely what you were trying to escape. The skill is throwing away the things that don’t matter for the question you’re asking.

“All models are wrong; some are useful.” — George Box

A model isn’t judged by whether it is true. It is judged by whether predictions made through it survive contact with new data.

Why this matters for the rest of the book

Everything in computer vision and machine learning is a model of something:

You will spend the rest of the book learning which models work for which questions, and what each one quietly throws away. The clean world of §1 is the easy case where the model holds perfectly. The next section is what happens when you actually run an experiment and the model and the world stop agreeing.


33. Noise Enters

Real sensors don’t give you f(x)f(x). They give you f(x)+garbagef(x) + \text{garbage}. The garbage has physical origins:

You cannot predict the garbage on any particular reading. What you can predict is its statistics — its distribution, its mean (often 0), its variance.

Running the same measurement many times

Hold xx fixed. Read the sensor 1000 times. In the clean world the readings would all be identical. In reality they scatter:

x_fixed   = 5.0
true_value = true_f(x_fixed)   # = 10.5
noise_std  = 0.8
num_reads  = 1000

# Each reading = true value + independent Gaussian noise
readings = true_value + np.random.randn(num_reads) * noise_std

print(f"True value : {true_value}")
print(f"Sample mean: {readings.mean():.4f}")   # close to true_value
print(f"Sample std : {readings.std():.4f}")    # close to noise_std

fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))

# Left: scatter of readings over time
axes[0].scatter(range(num_reads), readings, s=8, alpha=0.6)
axes[0].axhline(true_value, linestyle='--', label=f'true value = {true_value}')
axes[0].set_xlabel('reading number')
axes[0].set_ylabel('sensor output y')

# Right: histogram — noise has structure (bell shape)
axes[1].hist(readings, bins=40, alpha=0.7, edgecolor='white')
axes[1].axvline(true_value, linestyle='--', label='true value')
axes[1].axvline(readings.mean(), label=f'sample mean = {readings.mean():.3f}')
axes[1].set_xlabel('sensor output y')
axes[1].set_ylabel('count')
1000 repeated readings at a fixed x. Left: the sequence of values. Right: their distribution.

Two observations that matter for everything that follows:

  1. Single readings are almost useless. No individual reading equals the truth. The best we can say is “probably within the typical spread of the truth.”

  2. The readings aren’t random in a lawless way. They cluster — the distribution has a shape, a mean, a spread, symmetry. Noise is unpredictable at the level of one sample but predictable at the level of many.

That second point is the foundation of everything. We can’t beat noise on a single reading, but we can characterize it well enough to design algorithms that work on average. The mathematical name for this is statistics.

Running example — MVTec AD, Tile category

We will use one concrete dataset throughout this book to keep the abstractions grounded.

The MVTec Anomaly Detection (MVTec AD) dataset is a public industrial surface inspection benchmark from MVTec GmbH. We use the Tile category — grayscale images of ceramic tile surfaces captured under controlled overhead lighting by a monochrome camera. Some images contain defects (cracks, glue strips, discolorations, rough patches); most do not. The task: decide whether a surface patch is defective.

AbstractConcrete (MVTec Tile)
Source xxa surface patch at a fixed location
True value f(x)f(x)true surface reflectance at that patch
Measurement yygray-level pixel value recorded by the camera
Noise ϵ\epsilonsensor thermal noise, shot noise, stray light

This one dataset will be attacked three ways across the book:

  • Attack 1 — average and filter images to suppress noise and reveal defect structure

  • Attack 2 — fit a parametric texture model; flag patches that deviate from the fitted surface

  • Attack 3 — train a CNN on labeled defect/no-defect patches

In a perfect sensor, the pixel values would encode true reflectance exactly and defects would be trivially visible. In practice, noise and texture variation make this hard — and that difficulty is exactly what drives everything that follows.

Left: a defect-free tile. Right: a tile with a crack defect. MVTec AD dataset — Tile category (MVTec GmbH).

44. Why Is the Noise Gaussian?

Look at the histogram in §3. That bell shape isn’t coincidence. In physics and engineering, measurement noise is overwhelmingly Gaussian, and there is a deep reason: the Central Limit Theorem (CLT).

The CLT says: if you add up many independent small random contributions, each from some distribution (any distribution, as long as each has a finite variance), the sum tends to be Gaussian-distributed — regardless of the individual distributions.

Your sensor’s noise is the sum of many tiny independent contributions: thermal, quantization, vibration, EM. By the CLT, their aggregate is approximately Gaussian. This is physics, not a mathematical convenience.

num_draws = 50_000
K_values  = [1, 3, 10, 30]   # number of uniforms to sum

fig, axes = plt.subplots(1, 4, figsize=(15, 3.5))

for ax, K in zip(axes, K_values):
    # Sum K independent uniform(-0.5, 0.5) samples
    samples = np.random.uniform(-0.5, 0.5, size=(num_draws, K))
    sums    = samples.sum(axis=1)

    ax.hist(sums, bins=60, density=True, alpha=0.7, edgecolor='white')

    # Overlay matched Gaussian: variance of uniform(-0.5,0.5) = 1/12
    sigma   = np.sqrt(K / 12.0)
    x_plot  = np.linspace(sums.min(), sums.max(), 200)
    gauss   = np.exp(-x_plot**2 / (2 * sigma**2)) / (sigma * np.sqrt(2 * np.pi))
    ax.plot(x_plot, gauss, linewidth=2, label=f'Gaussian σ={sigma:.2f}')

    ax.set_title(f'sum of {K} uniform(s)')
    ax.set_xlabel('value')
    ax.set_ylabel('density')
    ax.legend(fontsize=8)
CLT in action: summing more and more uniform (non-Gaussian) samples produces a bell curve.

Part I detour: the full probability toolkit that makes this statement precise — Bernoulli → Binomial → Poisson → Normal → CLT — is built in Part I of this book. If your probability is rusty or if you want the derivations, read that first. If you trust the intuition above for now, continue.

The practical payoff arrives in Part I — Least Squares, where we build on this intuition to derive least-squares fitting as maximum likelihood under Gaussian noise — a rigorous justification for why so many algorithms in CV and ML use squared-error losses.


55. The Inverse Problem — Three Attacks

§2 introduced modelling and §3–§4 explained where noise comes from and why it tends to be Gaussian. Putting the two together gives the measurement model that underlies most of this book:

y=f(x)+ϵy = f(x) + \epsilon

— a clean signal f(x)f(x) plus a random perturbation ϵ\epsilon. The three “attacks” below are not three different ways of looking at the world. They are three different ways of building ff — three choices for how much structure you commit to before the data arrives.

Now we can state the general problem clearly.

Given noisy measurements yi=f(xi)+ϵiy_i = f(x_i) + \epsilon_i, where ϵi\epsilon_i is random with approximately known statistics. Goal: recover something useful about ff — specific values, the full function, or predictions at new inputs.

Three attacks exist. Each makes a different assumption about how much you already know about ff before you start.

Attack 1 — Averaging and signal processing

Premise: I can repeat the measurement at the same xx as many times as I want.

Take NN readings at a single xx. Their average yˉ\bar{y} has expected value f(x)f(x) (the noise averages out) and standard deviation σ/N\sigma / \sqrt{N}. Double the readings → noise drops by 2\sqrt{2}. This is the famous N\sqrt{N} rule — it is the whole reason that scientific instruments have “integration time” knobs.

Classical signal processing generalizes averaging — low-pass filtering, Wiener filtering, Kalman filtering — all are sophisticated forms of “combine many noisy observations to reduce uncertainty.”

Two flavours of averaging matter in practice:

Ensemble averageSignal average (moving average)
What you averageMany repeated trials at the same pointNeighbouring values across position or time
AssumptionEach trial has independent noiseSignal is locally smooth within the window
Practical limitNeed many repetitions of the same measurementBlurs sharp edges and fine detail
Imaging exampleAverage 100 frames of the same sceneSlide a window across one frame, average pixels inside it

In a lab you can often do ensemble averaging — hold everything still and repeat. In production imaging you rarely can (scene changes, one frame available), so signal averaging (spatial smoothing, Gaussian blur, moving average filter) is the practical tool. Wider window → more noise reduction but more blurring of real defect edges. Part II covers both in detail.

MVTec example: average multiple exposures of the same tile patch to suppress sensor noise, then apply a smoothing filter to separate the slow-varying background texture from sharp defect edges. This reduces noise and makes defect structure more visible — but only tells us about patches we have already imaged. It gives no prediction for unseen surfaces.

Parts II and III of this book develop this attack for the imaging case: sampling, sensors, pixels, contrast, and why raw-pixel operations run into fundamental limitations.

Attack 2 — Parametric fitting (known model form)

Premise: I already know the functional form of ff from physics or from prior knowledge. I just don’t know a handful of constants.

Examples:

Pick the constants that make the model best match the data. The standard criterion is least-squares: choose α\alpha and β\beta to minimise the total squared gap between each measurement yiy_i and the model’s prediction αxi+β\alpha x_i + \beta:

minα,βi=1N(yiαxiβ)2\min_{\alpha,\, \beta} \sum_{i=1}^{N} \bigl(y_i - \alpha x_i - \beta\bigr)^2

Intuitively: draw all possible lines through the scatter of points; the least-squares line is the one where the sum of the squared vertical distances from each point to the line is smallest. Squaring the gaps means large errors are penalised more heavily than small ones — a point twice as far from the line contributes four times the penalty. This is both a strength and a weakness: the fit responds strongly to every point, but a single outlier with a large gap pulls the line toward it to reduce that squared penalty. Least-squares is not outlier-robust.

The values of α\alpha and β\beta that achieve this minimum can be computed exactly from the data — no iteration needed. This is what classical statistics calls regression. You’re not discovering what ff looks like; you’re nailing down a few numbers inside a form that was handed to you by domain knowledge. The full derivation — why squaring, why this specific formula, and its connection to maximum likelihood under Gaussian noise — arrives in Chapter 7.

MVTec example: the MVTec tile surfaces are flat and the lighting is fixed, so Lambert’s cosine law simplifies to y=αr+βy = \alpha \cdot r + \beta — a linear relationship between true reflectance rr and pixel value yy. Fit α\alpha (lamp gain) and β\beta (dark current) from a set of defect-free calibration patches using least-squares. Any patch whose pixel values deviate significantly from this fitted model is flagged as a defect. The model generalises across the whole surface — but only because the flat-surface, fixed-lighting assumption holds. Change the lamp angle or surface curvature and the calibration breaks.

Attack 3 — Flexible learning (machine learning)

Premise: I don’t know the form of ff. But I have many (x,y)(x, y) pairs and I’m willing to spend compute.

Pick a flexible hypothesis class — polynomials, kernels, neural networks, transformers — and find the member that best matches the data. You’re not committing to a specific form, just a space of forms. The algorithm chooses the form from the space.

MVTec example: the Tile category contains five defect types — cracks, glue strips, gray strokes, oil spots, and rough patches — each with a different visual signature. No single parametric model covers all of them. Instead, train a CNN on the labeled MVTec patches: the network learns which combinations of local texture, edge, and contrast cues predict defective — without anyone specifying those cues explicitly. Attack 3 wins here because the variety of defect appearances is too complex to write down as a formula, but the patterns are learnable from data.

Parts V (CNNs) and VI (attention, vision transformers, multimodal models) of this book develop Attack 3. They are, structurally, elaborate parametric-fitting problems — but with hypothesis classes flexible enough to learn the form of ff rather than inherit it.


66. Three Attacks on the Same Data

To make the three attacks tangible, consider a simple simulation. We invent a true underlying function:

f(x)=1+0.5x+1.2sin(1.5x)f(x) = 1 + 0.5x + 1.2\sin(1.5x)

This is a mildly wavy curve — not a straight line, not wildly complicated. Think of it as the true reflectance profile of a surface as you slide a sensor across it. We then simulate 60 noisy measurements by sampling xx values uniformly between 0 and 6, computing the true f(x)f(x) at each, and adding Gaussian noise:

yi=f(xi)+ϵi,ϵiN(0, 0.42)y_i = f(x_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0,\ 0.4^2)

In a real experiment ff would be unknown. Here we keep it visible (dashed green line) so you can see how well each attack recovers it.

The same noisy dataset under three different attacks: averaging (a sharp point estimate at one x), linear fit (predicts everywhere but misses the wiggle), degree-10 polynomial (tracks the wiggle but risks overfitting).

What the figure shows:

No attack is universally right. The skill is matching the attack to what you already know about your problem. In practice you often combine them — average noisy readings first (Attack 1), fit a calibration curve to the averages (Attack 2), then pass the calibrated data to a neural network (Attack 3).

The mathematics behind each attack is built up across the book — why averaging reduces error in Attack 1 (Part I, probability), how the slope is derived from data in Attack 2 (Chapter 7, maximum likelihood), and how overfitting is detected and controlled in Attack 3 (Chapter 12, training). Each concept is introduced only when the tools to explain it properly are in place.


77. From 1D Signals to Images to Multimodal AI

Everything above used a scalar input xx. Real problems are almost always higher-dimensional:

The mathematics scale cleanly: wherever we wrote y=f(x)+ϵy = f(x) + \epsilon for scalar xx, we can write y=f(x)+ϵy = f(\mathbf{x}) + \epsilon for vector x\mathbf{x}, or y=f(x)+ϵ\mathbf{y} = f(\mathbf{x}) + \boldsymbol{\epsilon} for vector output. The three attacks stay the same. Visualization gets harder, the amount of data needed grows (the curse of dimensionality), and the algorithms become heavier — but the problem statement doesn’t change.

This is why the book’s title is Signals to Transformers and not Pixels to Transformers: the framework subsumes pixels, tokens, audio, and video alike. A transformer processing a paragraph, a ViT processing an image, and a CLIP model fusing images with captions are all solving the same inverse problem — they just work in different signal spaces.


88. Signal Processing vs. Parametric Fitting vs. Machine Learning

The three attacks all start from the same data {(xi,yi)}\{(x_i, y_i)\} and the same goal — recover something useful about ff. What separates them is how much you assume about ff before you look at the data, and what you get back in return.

DimensionAttack 1 — Signal ProcessingAttack 2 — Parametric FittingAttack 3 — Machine Learning
Prior assumption about ffNone about its form. Only that the signal is repeatable or locally smooth.The functional form of ff is known from physics / domain knowledge. Only a few constants are unknown.The form of ff is unknown. You only commit to a flexible hypothesis class (polynomials, kernels, neural nets).
What you chooseA filter / averaging windowA small set of parameters θ\theta (e.g. α,β\alpha, \beta)A hypothesis class + a learning algorithm
What you get backA denoised estimate at the xx values you measuredAn equation f(x;θ)f(x; \theta) valid everywhere the assumed form holdsA learned function f^\hat{f} that maps any xx to a prediction
Generalises to new xx?No — only describes points you sampledYes — if the assumed form is correctYes — within the support of the training data
Data appetiteModest: many readings at the same xxModest: few readings, but spread across xxLarge: many diverse (x,y)(x, y) pairs
Main failure modeOver-smoothing destroys real edges / fine structureWrong functional form → systematically wrong predictions, no amount of data fixes itOverfitting; poor extrapolation; opacity
What “fitting” meansChoosing kernel width, cutoff frequencySolving for θ\theta that minimises a loss (e.g. least-squares)Solving for millions of weights that minimise a loss
Typical toolsMoving average, Gaussian blur, Wiener / Kalman filter, FFTLinear regression, polynomial fit, exponential decay fit, calibration curvesCNNs, transformers, kernel methods, gradient boosting
Where taughtSignals & systems / DSP coursesClassical statistics / regression coursesML / deep learning courses
MVTec tile exampleAverage frames + Gaussian blur to suppress sensor noiseFit y=αr+βy = \alpha r + \beta Lambert calibration; flag deviations as defectsTrain a CNN on labelled defect patches across all five defect types

How they relate

These are not three disjoint worlds — they sit on a spectrum of how much structure you bring to the problem:

Signal processingno model of f    Parametric fittingnarrow, fixed model of f    Machine learningwide, learned model of f\underbrace{\text{Signal processing}}_{\text{no model of } f} \;\longrightarrow\; \underbrace{\text{Parametric fitting}}_{\text{narrow, fixed model of } f} \;\longrightarrow\; \underbrace{\text{Machine learning}}_{\text{wide, learned model of } f}

What replaces the physics prior in ML?

A natural question: parametric fitting commits to a specific equation (f(r)=αr+βf(r) = \alpha r + \beta, f(t)=Aeλtf(t) = A e^{-\lambda t}, etc.) handed down from physics. ML doesn’t. So what is its prior? It can’t be “anything goes” — that would never generalise.

The answer is that ML replaces the physics prior with a much weaker smoothness / regularity prior: nearby inputs should produce nearby outputs, ff shouldn’t wiggle wildly, the function should have some kind of structure that lets it be described by far fewer numbers than the training set contains.

Different ML models encode this weak prior in different ways:

ModelImplicit prior on ff
Polynomial regression (degree dd)ff has bounded curvature up to order dd
Kernel methods (RBF, Gaussian process)Nearby xx → nearby f(x)f(x), controlled by a length-scale
CNNLocal patterns + translation invariance + hierarchical composition
TransformerLong-range dependencies + permutation-equivariance over tokens
Neural net (generic)Compositional smoothness — small change in xx usually means small change in ff

So the contrast between Attack 2 and Attack 3 is really about where the information lives:

This is also why regularisation matters in ML but barely appears in parametric fitting: when your hypothesis class is small and physics-shaped, the form itself is the regulariser. When the class is huge, you need explicit pressure (weight decay, dropout, early stopping, data augmentation) to keep the model inside the smooth, well-behaved region of the space.

A few useful observations the table doesn’t show directly:


9Summary

ConceptKey idea
Measurementy=f(x)+ϵy = f(x) + \epsilon — signal plus noise
NoisePhysical, statistical, typically Gaussian (CLT)
Inverse problemRecover ff from (x,y)(x, y) pairs
Attack 1Average / filter at known xx values (signal processing)
Attack 2Fit parameters inside a known functional form (regression)
Attack 3Pick a flexible hypothesis class, let data choose the form (ML)
Three-attack splitBy premise and output, not by which course teaches them
SignalsPixels, tokens, audio, video — the same math covers all
Running exampleMVTec AD Tile: filter/average (A1), fit Lambert calibration model (A2), train CNN on labeled patches (A3)

Next → Part I — Math Foundations if you want the probability and linear algebra before the applied content, or skip to Part II — Signals and Measurement for the first applied chapter.