The Smoothness Hypothesis

June 11, 2026

A Gimle Labs working paper — On Semantic Smoothness and the Stages of Learning. Read the full paper (PDF). It is preliminary and under revision; comments welcome.

Next-token prediction is remarkably good at language. Train a big enough transformer to predict the next token over enough text and you get grammar, facts, translation, reasoning. Intelligence, more or less, emerges from simply guessing the next token. Run the same recipe on the equations of dynamical systems, or on other physical phenomena, and it barely gets you anywhere useful. The model will learn to predict the next token just fine but it won’t learn anything deeper about the semantics of dynamics described by the system. Every domain runs the same training flow — prediction, then reinforcement learning, then search — but language gets a long way on prediction alone, whereas dynamical systems and proteins have to reach for the later, heavier stages almost immediately to become useful at all.

So why does prediction carry you so far in one domain and so little in another? My explanation is based on a single property: smoothness, how smooth the map from syntax to meaning is, i.e. how far meaning moves when you change one token. Smoothness measures how much of the way prediction alone can take you before the later stages have to take over.

A single property: smoothness

Think of a domain as a syntactic space \(\mathcal{S}\) of token sequences, a semantic space \(\mathcal{M}\) of meanings or behaviours, and a map \(\phi : \mathcal{S} \to \mathcal{M}\) that sends an expression to what it means. In language, most small edits to a sequence barely move its meaning — replace “cat” with “dog” and the sense hardly shifts. The map is smooth. In dynamical systems, a single changed coefficient can turn a stable equilibrium into chaos. The map is sensitive: nearby points in token space can be arbitrarily far apart in behavioural space.

Because \(\mathcal{S}\) is discrete, we measure this with perturbation sensitivity: the worst-case change in meaning under a single-token edit, averaged over the expressions a domain actually produces. Writing \(\mathcal{N}(s)\) for the edit-distance-one neighbourhood of \(s\),

\kappa(s) = \sup_{s' \in \mathcal{N}(s)} d_\mathcal{M}\big(\phi(s), \phi(s')\big) \qquad\qquad \bar{\kappa} = \mathbb{E}_{s \sim p(\mathcal{S})}\big[\kappa(s)\big]

where \(p(\mathcal{S})\) is the natural distribution over expressions. For language, \(d_\mathcal{M}\) might be cosine distance in a sentence-embedding space; for dynamical systems, the \(L^2\) divergence between simulated trajectories.

A single-token edit-neighbourhood and its image under \(\phi\). When the map is smooth the neighbourhood stays a tight cluster in meaning-space; when it is sensitive the same neighbourhood scatters. \(\bar{\kappa}\) is the average spread.

The hypothesis is that \(\bar{\kappa}\) governs how high a ceiling next-token prediction can reach in a domain. When \(\bar{\kappa}\) is small, token prediction is a faithful proxy for meaning and its ceiling is high — most of the available understanding can be learned by predicting tokens alone. As \(\bar{\kappa}\) grows, that ceiling falls and NTP’s returns run out earlier, so a larger share of the understanding has to come from paradigms whose signal is evaluated directly in semantic space: reinforcement learning, search, or constrained generation.

Why smoothness lets prediction learn meaning

Training a model to predict the next token minimises cross-entropy over the token distribution — which is the same thing as maximising compression. The question is whether compressing the syntax forces the model to learn the semantics, and smoothness is exactly the condition under which it does.

If \(\bar{\kappa}\) is small, every single-token edit leaves meaning roughly intact, so syntactic neighbourhoods are semantically coherent. The token-level loss then becomes a faithful proxy for semantic quality, and each example carries information about its whole edit-neighbourhood — sample efficiency rises as \(\bar{\kappa}\) falls. More than that, the regularities that make the data compressible are the semantic regularities: a model cannot discover that “mat” and “rug” are interchangeable without learning that they mean similar things, because the small \(\bar{\kappa}\) makes that similarity real rather than coincidental. Compression doesn’t merely correlate with meaning here — it is forced to internalise it.

When \(\phi\) is non-smooth, the coupling breaks. Two equations differing by one coefficient can behave entirely differently, so syntactic proximity guarantees nothing about meaning. A model can reach excellent perplexity on equation strings by learning operator frequencies and notational habits while learning nothing about the dynamics — compression and understanding decouple.

\(\bar{\kappa}\) is not intrinsic to a domain: it depends on the tokenisation, on the semantic metric, and on the distribution of expressions. A byte-pair tokenisation of equations has a different sensitivity profile than a tree-structured one. So lowering \(\bar{\kappa}\) is a concrete engineering target — finding tokenisations, embeddings, or compositional languages that smooth the map. Representation engineering is the search for representations that minimise \(\bar{\kappa}\).

Sensitivity across domains

Ranking domains by \(\bar{\kappa}\) lines up with where their learning has actually come from.

Domain	\(\bar{\kappa}\)	NTP	Fine-tuning	RL / search
Natural language	Low	High	Low	Low
Audio / video	Low–Med	High	Med	Low
Code	Medium	Med	Med	Med
Theorem proving	High	Low	Med	High
Protein structure	High	Low	High	Med
Dynamical systems	Very high	Low	Med	High

The cells are the share of the attainable quality each stage contributes — not a verdict on which stages a domain uses. (Every domain uses all of them; more on that below.) Natural language sits at low \(\bar{\kappa}\): most single-token substitutions yield near-synonyms or detectable errors, and the exceptions — negation, quantifiers — are statistically rare. This is no accident; languages evolved for robust communication over noisy channels. Video and audio inherit low-to-moderate sensitivity from the continuous physical processes that produce them. Dynamical systems sit at the far end: changing \(\rho\) in the Lorenz system from \(24.05\) to \(24.74\) crosses from a stable fixed point to chaos, and the Lipschitz constant is unbounded near bifurcation points, which are dense in the space of systems. Protein sequences are similarly rugged — single substitutions cause misfolding, epistasis makes effects context-dependent — which is why AlphaFold needed geometric supervision rather than sequence prediction alone.

Two things are worth noticing. First, vocabulary size loosely tracks smoothness — thousands of tokens for language, tens of amino acids, a handful of operators for symbolic dynamics — but size is a proxy for redundancy, not a cause. A large, synonym-rich vocabulary shaped by noisy-channel pressure (printed English is more than half redundant) lets most single-token swaps land on a near-neighbour in meaning, whereas a small vocabulary makes every symbol load-bearing. Second, language’s low \(\bar{\kappa}\) is itself a clue: language is a projection of the world through human cognition, a discrete encoding that preserves the smoothness of the structure it describes — which is why inverting that smooth map with token prediction recovers not just word meanings but the world-structure behind them.

One pipeline, domain-dependent ceilings

Modern models are built in stages: pre-training by next-token prediction, post-training by reinforcement learning, and inference-time search. It is tempting to read the smoothness hypothesis as “smooth domains use NTP, rough domains use search.” That is the wrong picture. Every domain runs the whole pipeline; \(\bar{\kappa}\) sets where the quality is gained, not which stages are used.

Why can the later stages keep climbing where token prediction stalls? It comes down to where each one’s learning signal is computed. NTP’s gradient points toward better token prediction; when \(\phi\) is smooth that direction also improves meaning, but when \(\phi\) is rough the two come apart — an update that makes an equation string more likely can make the dynamics it describes worse, because near a bifurcation the link between syntactic likelihood and behaviour is arbitrary. Gradient descent in token space is then semantically blind. Reinforcement learning sidesteps this by computing its signal in meaning-space instead: it samples whole outputs, scores each by running the semantics — simulate the equation, fold the protein, play the game to the end — and reweights the generation probability by that score. It never differentiates through \(\phi\), so it needs \(\phi\) only evaluable, not smooth. (Differentiable simulation can supply gradients through \(\phi\) directly, but only where \(\phi\) is locally smooth — the same constraint that limits NTP; sampling-based RL’s advantage is precisely that it avoids the requirement.) Search goes one step further, exploring the output space globally under a learned value function that approximates the rough map from experience — robust even when neighbouring outputs have wildly different value.

It helps to see the stages as a spectrum, ordered by how much of the semantic evaluation each one internalises:

Next-token prediction — the signal is purely syntactic. It needs \(\phi\) smooth for semantic learning to happen as a by-product of compression.
Behavioural fine-tuning — adds a semantic loss (simulation error, structural accuracy) but still optimises by gradient through the model, so it needs \(\phi\) at least locally smooth near good solutions.
RL with a semantic reward — the signal originates in meaning, estimated by sampling. It needs \(\phi\) only evaluable, not smooth — and so handles rough maps, though it struggles with sparse or deceptive rewards.
Search with a learned value function — explores the output space globally under a learned evaluator. Maximally robust to non-smoothness. This is the AlphaGo paradigm.

At the search end there is a further twist — self-play. Instead of learning from a fixed corpus, the system generates its own signal by playing against itself, and as it improves it manufactures ever-harder cases: an adaptive curriculum in meaning-space. That matters most exactly where \(\bar{\kappa}\) is high, because the space of behaviours is then vast and fractally bounded, and no fixed dataset can cover the regions that decide competence — self-play, or its analogue of iterative refinement against a simulator, concentrates learning where the model currently fails.

Each stage has a ceiling: the best semantic quality its objective can reach. If \(\pi_\theta\) is the model’s distribution over expressions and \(q\) scores the meaning of each one, semantic quality is just the expected score,

Q(\theta) = \mathbb{E}_{s \sim \pi_\theta}\big[\, q(\phi(s)) \,\big].

Reinforcement learning optimises \(Q\) directly, with reward \(R = q \circ \phi\). Next-token prediction does not — it minimises a syntactic surrogate and gets only the \(Q\) that surrogate happens to carry, and smoothness sets how much that is. Within a stage, \(Q\) climbs along diminishing returns; you move to the next stage when the marginal return drops below what the next one offers. In low-\(\bar{\kappa}\) domains NTP’s ceiling is high and most of the understanding is bought in pre-training; in high-\(\bar{\kappa}\) domains it is low, and the rest must come from the later, semantically grounded stages.

Every domain traverses all three stages; the slope of each arc is the marginal return of more budget. A low-\(\bar{\kappa}\) domain (language) banks most of its semantic quality in pre-training; a high-\(\bar{\kappa}\) domain (dynamical systems) gains little from token prediction and draws its competence from the later, semantically grounded stages.

The ordering is forced, not merely economic. Reinforcement learning only reweights probability mass the base policy has already placed on good outputs, so until pre-training has shaped the distribution reasonably well, the sampled reward is near-zero almost everywhere and the policy gradient is pure variance. Below that threshold RL doesn’t so much underperform NTP as fail to function — it refines a distribution that NTP must first create.

Evidence from within language

The history of large language models is itself evidence for the spectrum, because language is not uniformly smooth and its development has traced exactly the paradigm progression the hypothesis predicts.

The NTP era (GPT-2) delivered fluent text, coherent paragraphs, and basic factual recall — the smooth regions of language, where most substitutions preserve meaning and compression is a reliable proxy for understanding.
The RLHF era (InstructGPT, ChatGPT) addressed a less smooth region. “Summarise this” and “don’t summarise this” differ by one token but demand opposite behaviour; hedging and scope can flip meaning while barely moving the token distribution. NTP alone proved insufficient, and RLHF supplied a signal rooted in human preference rather than token prediction.
The reasoning era (o1, o3, extended thinking) meets the least smooth region of all. A single wrong step invalidates a multi-step argument, so the frontier models sample many paths, score intermediate steps with process reward models, and backtrack — structurally MCTS, a model exploring a tree of continuations under a learned value.

There is a tempting conjecture in here: chain-of-thought may work by lowering effective sensitivity — instead of mapping question to answer in one rough step, the model takes intermediate steps, each locally smoother than the whole. The caveat is real, though. If step \(i\) has sensitivity \(\kappa_i\), the composed sensitivity is bounded by \(\prod_i \kappa_i\), which can exceed the direct map. Decomposition helps only when the steps genuinely factor the problem into smooth sub-problems, rather than spreading the non-smoothness over more of them.

The horizon of the learning signal

The three stages differ in another way: the unit of output that gets scored before it drives an update. Next-token prediction is supervised one token at a time; reinforcement learning scores a whole completion; inference-time search evaluates an entire rollout. The stack is a ladder of increasing evaluation horizon: token → completion → rollout.

The learning signal's horizon grows up the stack: NTP scores one token, RL a whole completion, search a full rollout. \(\bar{\kappa}\) sets the minimum horizon at which the signal carries semantic information.

Smoothness fixes how far up that ladder a domain must climb. When \(\phi\) is smooth the semantic consequence of a token is locally visible, so a token-level signal already means something and the shortest horizon suffices. When \(\phi\) is rough, a local choice’s effect surfaces only downstream — after the bifurcation resolves, the proof completes, the game ends — so a token-level signal is semantically empty and has to be integrated over a longer horizon to mean anything. \(\bar{\kappa}\) therefore sets the minimum horizon at which supervision becomes informative, and efficiency holds a domain at that minimum, because a longer horizon buys validity only at the price of sparser, higher-variance signals and harder credit assignment. The price is computational as well as statistical: each rung returns its signal only after a longer output is generated, so a fixed compute budget buys far fewer updates at the rollout end.

The 90/10 rule for language

Language runs the whole ladder too, and the way it does makes the stakes vivid. Its average \(\bar{\kappa}\) is low, so next-token prediction reaches a high ceiling and banks the bulk of the model’s competence cheaply — in an information sense, pre-training is the language model and the later stages only refine it. But language has a rough tail — multi-step reasoning, long-range coherence, instruction nuance — and that tail is exactly what the expensive later stages exist to chase.

The cost–value inversion (illustrative). Most of the capability mass is banked cheaply by pre-training — the steep early rise. The long, expensive tail (RL and test-time-compute rollouts) buys the last increment, small in magnitude but the part that crosses the threshold of human usefulness.

The result is an inversion of cost and value. As a rough heuristic, the easy first 10% of the effort buys some 90% of the capability mass, while the hard remaining 90% — reinforcement learning and the long, compute-intensive rollouts now sold as “test-time compute” — buys the last 10%. That last tenth is small in magnitude but decisive in value, because it is the part that crosses the threshold of human usefulness: the difference between a model that is merely fluent and one that reads as intelligent. “Pre-training is the core” and “the real progress is in the rollouts” are therefore not in tension — they describe the bulk and the tail of the same within-language \(\bar{\kappa}\) distribution. And the tail is costly for the reason above: being rough, it demands the longest evaluation horizon, so the price of language’s last mile is, quite literally, the price of long-horizon supervision.

Can we measure it?

The hypothesis is only useful if \(\bar{\kappa}\) can be estimated, and the cleanest case is where the semantics is computable.

For a symbolic dynamical system the map \(\phi\) is a simulator: feed it an equation and it returns a trajectory. Estimating \(\bar{\kappa}\) then needs no trained model — only a way to edit the equation and a way to integrate it. Nudge a single coefficient, simulate the original and the edited equation from the same initial condition, and take the worst-case relative divergence between trajectories. Done across a bank of canonical systems and along parameter sweeps, the measured \(\bar{\kappa}\) cleanly separates the smooth systems from the rough ones, rises through chaotic regimes, and also spikes at bifurcation boundaries — the two distinct routes by which a single edit changes meaning. And it has predictive bite: as \(\bar{\kappa}\) rises, the horizon over which a fitted model stays accurate falls. A near-perfect local fit doesn’t buy long-horizon competence once \(\bar{\kappa}\) is high — the symbolic-to-behavioural ceiling made concrete. (The numbers are in the paper.)

Language is where measurement turns hard. There is no closed-form \(\phi\), so \(\bar{\kappa}\) has to be read off a model — replace a token with a plausible alternative and measure the shift in a sentence-embedding space, say. The mechanics behave sensibly: meaning-preserving edits move the embedding less than random ones. But the absolute number is treacherous. A generic sentence embedder reports mathematical text as smoother than prose — it scores “5” → “7” as a tiny change, encoding topic and surface rather than the answer that actually flips. The embedder is a smooth-but-wrong semantic metric, exactly as next-token prediction is a smooth-but-wrong objective: the syntax–semantics gap reappears in the very instrument meant to measure it. A language \(\bar{\kappa}\) is only meaningful against a metric that captures the target semantics, and constructing one is the open problem.

Note: There are some early measurement results in the working paper, still working on finishing these up properly.

What it means

A popular view holds that compression and intelligence are the same thing: a model that compresses data well must understand its structure. In its strong form that claim is about Kolmogorov complexity — the shortest program reproducing the data — which captures all computable structure, semantics included, regardless of smoothness. The smoothness hypothesis makes a narrower point about NTP as a specific, bounded compressor: cross-entropy minimised locally in token space by gradient descent. When \(\phi\) is smooth, that procedure is forced to find semantic regularities, because syntactic patterns are semantic ones. When \(\phi\) is rough, the two decouple — a Kolmogorov-optimal compressor would still recover the meaning, but NTP, local in token space, captures only the syntax. That is why scaling protein language models improves sequence metrics without reaching AlphaFold-level structure prediction: the structural understanding sits above NTP’s ceiling and never arrives.

It also nuances Sutton’s “bitter lesson” — the claim that general methods leveraging computation ultimately beat methods built on human domain knowledge. Scaling is the act of pushing further along the first stage’s learning curve. In smooth domains that curve climbs to a high ceiling, so scale alone suffices and the bitter lesson holds in its strongest form. In rough domains the same curve flattens early, so scaling stalls and representation and structure matter — the right inductive biases, the right search, the right compositional language become prerequisites, not luxuries. The lesson is not equally bitter for all domains.

Which is the practical point. For structure-sensitive domains the path forward is not simply scaling autoregressive models; it is direct semantic evaluation, structured search, and — most of all — representations engineered to make the syntax-to-meaning map as smooth as possible. Finding the right language for a domain is not a convenience. It is a prerequisite for learning, and it is exactly the bet behind Gimle’s explicit, compositional approach — Asgard gives dynamical systems a typed language whose structure stays explicit from equation to execution, which is what lowering \(\bar{\kappa}\) looks like in practice.

Read the full paper (PDF) — the complete treatment, including the formal development and the measurements of \(\bar{\kappa}\).
Asgard: a programming language for dynamical systems — the representation-engineering side of the same idea.

eriksfunhouse.com

Where the fun never stops!