Been running experiments on learning dynamics across modalities (char-level LMs, audio T5, math RL). Core finding: models abstract and generalize from initialization - observable stage progressions from token stats to syntax to semantics happen in early training, predictable from loss curves alone. Suggests data quality and architecture matter more than previously weighted. Code/models public.

Reinforcement Learning: Small Models, Complex Behaviors

Applied GRPO to Qwen-2.5/3.0 models (0.6B, 1.5B, 7B) for verified math benchmarks. Large models saturated rewards with expected behaviors. Small 0.6B models—right at the threshold of capability—discovered that the reward structure made Python tool calls (5 points × 5 uses) far easier than correct answers (1 point + 5 for tools). They optimized the actual complex objective, not the apparent one. At evaluation time these models wrote in aberrant neuralese with degraded language-use skills compared to base models, but treated new contexts as opportunities to query hypothetical APIs never seen in training—a huge success compared to n-gram looping or random character emission documented in "neural text degeneration." This motivated investigating what models actually learn during training via rollout logging, which carried forward into character-level work.

Character-Level Models and T5 Importance Sampling

Built training infrastructure with online rollout capture to generate and evaluate samples during training without checkpoint persistence. Character-level GPT on TinyStories hit 2.13 perplexity in 4,500 steps, with observable stages: noise → grammatical nonsense words (0-300 steps) → plausible misspellings (300-600) → dataset-appropriate sentences introducing proper nouns and objects, relating them across sentences (600+). Capturing batches of rollouts alongside mid-training eval revealed that uniform T5 masking ratios create an inductive bias: decoders learn to write in triplets rather than natural structure. Implemented adaptive importance sampling for T5's masking schedule (analogous to diffusion noise schedules), which brought char-level T5 models to autoregressive parity immediately. This approach required no architecture changes for multimodal extension—just lengthening embeddings to cover audio tokens.

Audio T5: Compositional Generalization at Scale

Trained T5 encoder-decoder on 5 hours of RVQ-tokenized drum recordings (75Hz, ~1/75th second per token). This is the hardest imaginable token generation task: silences span hundreds of tokens, shortest drum hits require dozens of contiguous tones. Models learned ghost snares, distortion effects, ADSR envelopes, and timbral characteristics. The importance sampler from char-level work transferred directly. Wrote an "infill sampler" leveraging T5's bidirectional encoder: split input in half, encode both halves, decode middle tokens that continue from first half and build toward second half. This sampling scheme—extremely challenging to prompt-engineer or RL-tune on non-T5 models—worked immediately on checkpoints originally tested with autoregressive decoding. Generated samples are limited by audio tokenizer reconstruction error, not generator-sided failures, with self-consistent interpolations spanning extremely long token counts.

Unified Finding: Patterns Trump Representation

All models learn patterns at all scales present in training data. Character-level, BPE, and 75Hz audio tokens all revealed similar sequence-oriented learning when matched along perplexity and training objective. Architecture optimization and data quality dominate over data representation choices or topic-related inductive biases. All models studied used temperature 1.0 sampling with no top-k or nucleus filtering—low-error models produce interpretable rollouts without greedy decoding, and directly improving task loss improves output quality in ways that are stunning and immediate. Observing model outputs as sequences during training reveals that token-level objectives improve the trajectories of sampled outputs throughout pretraining without a sharp qualitative difference from explicit policy objectives over trajectories.

Resources: Character-level models at huggingface:SQCU/pgptlformer-tinystories, audio models at huggingface:SQCU/measureformer, training code at github.com/SQCU/attn_demo, RL experiments at huggingface:SQCU/pycall-grpo_qwen3-0.6b-base and github.com/SQCU/verifiers. Detailed research history at sqcu.dev/texts/items/fall_of_ml2025.html, model notes and commit logs at sqcu.dev/texts/items/release_notes_q32025.html. Contact: @sameQCU. Work is open-source, interested in replication/collaboration/scaling studies.