so what's the big deal with RL?
tl:dr;
prime intellect is pretty cool. they rent me spot gpus and coincidentally distribute the *most reproducible* reinforcement learning trainer. the clickbaitiest contemporary reinforcement learning training reports are way less surprising than training policy models yourself. policy model output samples and reward scores are super interpretable and legible! very cool era of machine learning, we should all have fun training machine learning models together.week 1:
it might be surprising to hear, but sometimes it is possible to outline what a computer program ought to do before you run it for the first time. the gap between 'ought' and 'is' can be cavernous, of course, and grows wider in the discourse of computer programming than perhaps any other. suffice to say, we measure 'oughtas' by our lack of surprise when an image encoder spits out a compressed file whose corners are blurry, and our presence of surprise when clicking on an email locks our computer & threatens to delete our computer files unless we beat a magical girl shooter game.so with that in mind it's not unusual in the tradition of the analysis and interpretation of computer programs to 'register a prediction', as the web-bayesians put it, on what we expect a machine learning program to do. a contrastive reading of different source texts (read, every policy modeling paper used in a real ML model) claiming to define 'policy'. it turns out that 'policy' in the 'ai theory' of the old, dying, and AGI-less era means one element from a wasteful triplication of a computer program into 'policy', 'value', and 'reward' programs. this is done only for the purpose of a slow, neurosymbolic, and triply intractable 'search over other computer programs', with no special guarantees to its efficiency or stability.
anyways it took two days to do all of the related theory work and another 5 to try to build VLLM and all of the linked dependencies corresponding to a reference paper: One-Shot-RLVR. dependency graphs make fools of us all. the code release of the One-shot-RLVR paper used combinations of packages which are mutually incompatible, with some combination of training, logging, evaluation, and RL-specific runtime acceleration all blocking the research code from training further models while strictly duplicating the experiment design (codebase).
week 2:
after some urgent code review, the entire codebase of the One-Shot-RLVR research team is thrown out in search of *any* reproducible RL training methodology. there's a little bit of experiment design work and ominous foreshadowing of the fundamental challenge of string manipulation in machine learning. it appears that the 'idea of verification' in machine learning is depends on extracting some signal (e.g. the answer to a math problem) which is more robustly and certainly judged or evaluated than 'the entire output'. in machine learning practice, this means giving 'policy models' absolutely no 'training reward' unless they answer problems correctly and also within special answer-submission fields legible to the training code.put simply, 1+1!=2. 1+1 only equals <answer>2</answer> (and might also equal <answer> 2</answer> if the training protocol designer is extremely generous).so what happens when training with verifiable reward signals on the reproducible training methodologies? weird stuff, apparently. the earliest trial model reused willcb's multi-turn-tool-use RL training template, which introduces multiple auxiliary reward components, particularly for the correct use of training-program-parseable output fields. curiously, these auxiliary reward components appear to work *too well*, and the resulting policy model learns to write an incantation of
in response to all possible math problem training inputs.{"name":"python", "args":{ "code":"import numpy as np; print(np.array([1,2,3])+np.array([4,5,6]))"}}
the model developed a strategy using every important buzzword that gives 'reward' (relative to other chains of words it could write):
1: use 'tool syntax' correctly (+reward) to...
2: execute a 'tool call' (++reward)...
3: using semantically correct (++reward)...
4: python (+reward)...
5: as many times as possible in a multiturn training sequence (+++reward).
intriguingly, if the reward weighting for the training protocol *insists* that using python 5 times is 5 times more important than submitting a correct answer, it is a RL training *error* for the policy model to submit correct answers (1 point?) instead of using python 5 times (5 points!)
week 3:
regardless, the successfully trained models in week 2( available on huggingface) were trained against a normal 'math verification' training set of multiple problems, each with their own solutions. perhaps the dataset ordering of which problems were trained in which order was in some sense responsible for the model's behavior at test time! curriculum-learning-by-accident would come as no particular surprise here.furthermore, the research agenda motivating this work was 'find out what policy gradients do to language models', not 'several uncontrolled dataset ordering studies on import-numpy-as-np modeling in a multi-turn setting'. also, willcb of primeintellect&verifiers just patched their reinforcement learning suite with all sorts of changes to RL environment scripting, RL scoring scripting, and some other stuff.
week 3 of the reinforcement learning hackathon was then spent writing dataset retemplating code. discovering that the qwen2.5 and qwen3 chat template has critical-runtime-error-inducing text parsing errors... which can be fuzzed accidentally by policy optimization. discovering that the huggingface math-verify tool( suggested by many ML resources, including verifiers) for parsing formally equivalent math formulas, will blow up pareto-frontier ml trainers which use multiprocessing. (aside: modern servers have hundred*s* of cpus. many machine learning researchers still breakingly singlethreaded code. what a curious world in which we live.)