curriculum

return


the fall of 2025
machine learning works


the core finding:

Language modeling is weird. The last 30 years of regular expressions and computational linguistics have insisted quite firmly and definitely that human languages are not only orderly and follow definite rules, but that those rules are intrinsically challenging, averse to representation, and require decades of monastic study (e.g. to write successful sql or regex queries in an intentional way). Yet in the 2020s, it has become increasingly popular to assert that there is a 'second face' to language. Under that glowering gaze extremely expensive and fancy computers a normal person cannot or at least politely ought to pretend not to understand may emit stunning pablum as 'king plus queen equals double monarch—no, that's not it, it's prince plus prince equals true love'. This is the face of 'generators' rather than 'classifiers' or 'translators'.

What challenges us as residents of the present is whether computational linguistics is hard because linguistics is hard, whether machine learning programs live in the grandest computers ever built out of absolute minimum necessity (perhaps bigger computers are different by bigness alone, and special understandings are available only to the biggest around), or perhaps, whether we have been gettier-cased on both topics.

The Case For The Gettier Case: Models learn representations of their training task: the BPE token, the character, the 1/75th of a second of amplitudes. But models learn representations of absolutely everything else in their contexts, with no evidence of 'double descent', 'grokking', 'overparameterization regimes'. While loss descends, the model is learning. While the model is learning, the model, it seems, is learning everything in the training context. Abstraction, generalization, and reasoning appear earlier than expected.

summer 2025:
reinforcement learning

case study:

source:

What happened: Applied GRPO to Qwen-2.5 and 3.0 family models of 0.6b, 1.5b, and 7b scale for verified math benchmark tasks. Used willcb's verifiers repository.

What the models learned: Given a boring task, the largest policy models immediately saturated their rewards and developed no surprising or interesting behaviors as policies. The smallest and most novel policy models, however, developed a strategy of writing `{"name":"python", "args":{"code":"import numpy as np; print(np.array([1,2,3])+np.array([4,5,6]))"}}` in response to all math problems during policy rollouts.

Why this happened: The reward weighting and delicate in-context-learning capabilities of the 0.6b models inadvertently made using Python 5 times in a multi-turn sequence (5 points) far easier to discover than any policy submitting correct answers (1 point, plus 5 points if the model uses a lot of tool calls along the way). The model correctly optimized its actual objective—which was both more complicated and behaviorally broad than the apparent objective.

Why this mattered: If training models too small to learn the 'math benchmark format output' before they learned the 'python tool calling reward' led to damaged generalization or the loss of almost all model knowledge or behavior, this would be failure. Instead, the tiny models with saturated function-calling rewards and lax attitudes towards well formatted answers demonstrate an amazing flexibility in output behaviors, treating new and unseen text contexts as opportunities to query new apis which might earn them new and unseen format rewards, such as a hypothesized `{" url ": " https :// api . example . com / search " , " query ": " capital of france "}` never presented as training data or rewarded for use during policy optimization. The absence of total mode collapse and total 'neural text degeneration' at eval time of models writing even extremely repetitive and 'narrow' text distributions in training environments casts strange light upon the 'stochastic parrot' thesis of the 2020s, those eerie shadows revealed leading far from the 'random, too large, unexplainable' analysis of models of language.

fall 2025:
char-level models

release notes & logs:

source:

What happened: Returned to character-level modeling (initially for FFN sparsification experiments). Built custom training infrastructure with online rollout capture—the ability to generate long autoregressive samples during training and measure their attributes in the same manner as output table evaluations—without checkpoint persistence.

What the models learned: Character-level GPT models trained on TinyStories achieved 2.13 perplexity within 4,500 steps from random initialization, rather than a meaningless and vast error compared to a scale-matched BPE tokenization baseline. More importantly, they showed observable learning stages:
- 0-300 steps: Random noise → grammatical sentences with nonsense words
- 300-600 steps: Plausible misspellings, basic story structure
- 600+ steps: Coherent model-scale-appropriate narratives with occasional character-level artifacts
The progression: phonographic patterns → stress patterns → large n-gram patterns → syntax → semantics.

Novel evaluation framework: To measure this properly required writing extremely nonstandard training, inference, evaluation, logging, and a gentle sprinkling of standard+prosaic stats code. At the time of this research inquiry, the autoregressive/rollout evaluation of 'hot' models during a training schedule was a theoretical topic of excited and extremely narrow interest among the large language model policy optimization community.
There were also some metrics tracking spelling, syntactic, and semantic feature match between train and eval prose. The metrics were and remain arbitrary and post-hoc -- adequately confirming the tendency for models to write almost exclusively orthographically and phonetically implausible typos and misspellings in early initialization, then enter a one-way 'phase transition' to nearly exclusively dataset-adherent behavior both typical and erratic.

Why this mattered: This ordering appeared regardless of evaluation metric. Models are optimized to 'imitate next-tokens' for perhaps *hundreds* of training iterations. Thereafter almost all loss descent in a training history reveals new sequence-scale behaviors like chunking passages into words, delimited subspans, and the composition of words and delimiters into sentences. This progression is legible in autoregressive decoding, matching the frequency of sequence-scale and token-scale features of the dataset. This is not a confusing secondary property requiring special analysis to detect or trace: it is trivially predicted by (descending) training loss, and is in strict terms the only mechanism by which loss descent can decrease below the loss of an optimal bigram statistics model of a dataset!

fall 2025:
tune-to-tune transfer transformer

release notes & logs:

What happened: Recorded 5 hours of drum sessions, deliberately capturing all studio audio, whether musically complete or even related to any sort of final production media.
The goal: create a dataset of nothing but multiple-scale patterns which cannot be 'solved' by a 'reductio ad bigram' or solution by the embedding and unembedding layers alone. This is strongly motivated by the transformers circuits papers.
Trained T5 encoder-decoder models on:
1. Character-level text (for baseline comparison)
2. RVQ-tokenized (Encodec) audio from drum sessions

What the models learned: Audio representations are a dense and uniform encoding of many 'codebook tokens' which each represent the exact content of approximately 1/75th of a second of waveform. This is the hardest dataset imaginable for token generator models. Meaningful silences can be hundreds of tokens long. The shortest and most off-beat sounding 'drum hit' requires the autoregressive decoding of contiguous spans of dozens of dataset-adherent tones, while also counting out deliberate spans of explicitly represented negative space. So, obviously, the models immediately learned to play drum and bass ghost snare hits, a wide family of distortion effects, and the ADSR envelopes and timbral character of every drum demonstrated.

Why this mattered: Sequence models seem to really like sequences. Whether the input is sequences of bytes originating from text or sequences of placeholders for different 75hz samples, models appear to spend almost all of their training time after escaping initialization 'noticing sequences' and 'imitating sequences'. Language models learn well-presented patterns of seemingly very great complexity far faster and at far smaller training scales, yet also choke to death on poorly trawled 'internet scraped garbage' at training scales far earlier and on problems of apparent complexity far more modest than early 2020s 'scaling discourse' implied.

current work:

Distribution matching loss functions for *encoding artifacts* in dataset preprocessing. Heretofore unseen and perhaps undescribable dataset error measurement and cleaning strategies. Spectral methods for image embedding. Reviews of old papers from 2011 on R^5-space metrics for detecting and preserving spatial features in high-dimensional raw data. Rationalizations and perhaps extensions of the latent consistency modeling auxiliary loss functions to reduce the train-test gap in the famously poorly explained 'diffusion model' denoising task formulation.

Contemporary architecture deep learning models sure seem to like to learn. Models appear to have no limit to how much they can continue to learn from well-posed tasks and well-represented data. Many interesting behaviors of ml models are almost unobservable because the models learn extremely challenging behaviors and representations almost too quickly to catch them in the process. What intractable problems remain seem to suggest egregious errors in the encoding and presentation of training data, loss functions, sampling functions, or evaluation tools. Some programming, it seems, is required beyond grep & gpt-2.

the unified picture:

Across three modalities and three objectives, the pattern holds:

1. Early training: Decodings change as sequences by the overt unit of training (characters/words, spans of exact audio samples, output formats)
2. Mid training: Decodings change to resemble all patterns beyond the unit of training (stress patterns, negative space, imagining new justifications to call tool APIs).
3. Later training: Reconciliation of sequence-sized patterns of all scales (dataset-adherent semantics and syntax, reliably playing off-beat, imagining and testing plausible new APIs from context)


This result is bluntly similar for models of similar perplexity/loss, even as data encoding or training objectives vary. This suggests:

- Every good self-supervisory training objective is abstracting and generalizing overtly and immediately.
- Tokenization strategies are remarkably similar to each other in strengths and weaknesses--putative differences are overblown.
- Non-policy methods sure look like policy optimization if you observe differences in model outputs-as-sequences during training like RL practitioners observe model 'rollouts'.
- Concrete dataset composition and the pedantic mechanics of DNN architecture and training matter more than ever. Lower losses are more indicative of good models even more than previously believed.

research extensions:

- There is a 'moravec's paradox' flavor to 'learning to read and write' at all through the mechanism of the token embedding and unembedding units, representing a serious challenge a LeNet-5-sized model might never overcome. Identifying lower bounds for concrete architectures would be useful here.
- Compact models fit so quickly in the 2025 post-nanogpt-speedrun-regime that training a model upon a dataset split is a tractable measurement of the 'empirical task difficulty' of that dataset. This suggests several exciting approaches towards unsupervised or self-supervised curriculum learning of tasks or capabilities.
- Measuring abstraction and generalization 'hierarchies' or 'developmental sequences' is now tractable, possible, and in various inductive-bias-free flavors befitting descriptivist attitudes towards cognition.
- Policy optimization methods are axiomatically cursed by the definition of 'relative advantage within sampleable behaviors'. Unsupervised / self-supervised curriculum-modeling strategies implied here might be necessary to extend the maximum 'developmental trajectory' a policy gradient fine tune can navigate before it reaches 'saturated reward' or distributions of behaviors whose advantage has extremely low signal-to-noise-ratio.

contact & collaboration:

This work is open (you can copy it) and resource-limited (scale is limited to wide and exploratory dis/confirmations).
If you're interested in:

- Replicating these experiments
- Formalizing sequence analysis of sequence models
- Scaling anything demonstrated here
- General purpose intrigue

Reach out via email or Twitter (@sameQCU).