release notes
fall 2025
sike the release is attn_demo 3 times in a row
char-level lm:
motivation:
"so anyways. i want to try out the liu,gan,tegmark BIMT loss. but i also want to try out the 'confabulated memory of the l,g,t,B loss'""aight uhhh it's time to uhhhh do uhhh. 'dry run' character level transformer without doing anything tricky."
It's best to set Liu, Gan, Tegmark BIMT loss aside for now. LGTB loss can be considered a sparsity-motivated regularization, dissimilar to L2 regularization, which maximizes the quantity (and perhaps contiguous volume) of 'FFN' or 'Dense' network connections set to '0'. The immediate character-level-sequence language model training was posed as a testing tool to identify sparsity induced model collapse which might be subtler in retrained webtext checkpoints, who might 'lose capabilities' related to 'low frequency tokens' before obvious degradations on eval tasks.
results:
{paragraph sized div goes here. remark on package of plaintext rollouts packaged with model. remark that rollout statistics are retrospective, over a large collection of rollouts packaged with the trained checkpoints. worth revisiting with a better semantic distance measurement at some future time.}commits:
{paragraph sized div elaborating analyze_rollouts.py. particularly analyze_rollouts is a decoupled secondary process to nibble at the contents of per-training-run parquets which are written by the training process because i was too change-averse and risk-minimizing to implement an in-memory database service as the i/o bottleneck for an already-robust write-to-file minimal-include trainer.}
Commits on Sep 5, 2025
"oops i packaged the peek log. that was an error, but like, i guess its in your history now hahahaha"
https://github.com/SQCU/attn_demo/commit/88cfe04d26058ad81de9d6b5ee7c8e06ab0dd72b:
"few more deps for eval. eval runner digests captured rollouts from yesterday's patch, still uses a wack-ah-ah edit distance matrix which must be revised. plots and the implications about the underlying data are solid reference code."
Commits on Sep 4, 2025
"you can capture rollouts while training if you want. but if you do you'll cause cuda graph breaks and everything will take like 160% as long to train. isn't that cool! isn't python-torch-triton a great language???"
"dependencies supporting rudimentary sub-word error measures, sentence-level syntactical similarity, sentence-embedding semantic similarity."
"wow i was eepy last commit okay this should be everything needed to run a character-level-transformers train"
"gitignore set wrong??? dataset processing scripts should show up now"
Commits on Sep 3, 2025
"hm."
"character-level-sequence modeling studies"
char-t5 denoising:
motivation:
"another way to put this is that an encoder-decoder model is like a LLM that costs half as much to run, per parameter, as a measureGPT with the same memory budget.""guys i found the not-so-hidden but surprisingly subtle inductive bias in the t5 objective"
{paragraph sized div goes here, focus on extreme similarity between tinystories-config models and the original t5 paper's BERT_BASE configuration. emphasize this is a forwards porting of the t5 paper's methods and scaling properties motivated by the paligemma, PaLM, and nanogpt-speedrun architecture optimization studies.}
results:
"now THAT'S a character-level encoder-decoder model (lmao):Once upon a time, there was a biken did not want to trepend.
Plep.
Sue heavy. The looked around next scaper. It named Dad neveror. He sub. Sue wing.
The moot. Ent marmany dars. Bopens.
Tom jamors. Sue wam.
Tom Sue."
"wahahaha it is working!
the t5 objective model is now writing grammatically and syntactically structured outputs which are far more reminiscent of the 'LLM objective' despite using the encoder-decoder stack!"
{paragraph sized div goes here, establish contrast demonstrated in links: uniform t5 task mask ratio sampling produces decoders who try to write outputs in triplets. adaptive importance sampling is derived and implemented as a hurried aside, and brings t5 models to parity for 'free' with a one-shot 'yolo' commit.}
commits:
{paragraph sized div goes here.}...adaptive importance sampling for the masking schedule, directly analogous to adaptive noise schedules in diffusion models seen in kingma, gao 2023 (arXiv:2303.00848), appendix F...
p(λ) ∝ E[w(λ)||ε - ε̂_θ(z_λ; λ)||²] (appendix F)-->
p(bucket) ∝ exp(-λ * loss[bucket]) * base_distribution[bucket]...lagrange multiplier formulation is gentler and distribution shift is limited for smoothness...
ergo
p(λ) ∝ loss(λ) (appendix F)->
min_{q} KL(q || p_base) subject to E_q[loss] = targetSolution:
q ∝ p_base * exp(-λ * loss) where λ satisfies constraint...
Commits on Sep 15, 2025
"i forgot which hyperparameter i reverted but like dw about it."
github:sqcu/attn_demo:
20e7fa8b1569fd01552a9ce0862948a70f614cd1
"okay so t5 training now uses a curriculum learning algoslop constraint solver which, like, uh, does some stuff to like, well, it should make everything work better and also incude multiscale grammatical structures and pseudowords in t5 models that otherwise write very short fragmentary n-grams"
"... okay fixed special token masking this trains some wack-ass t5-architecture models"
Commits on Sep 14, 2025
github:sqcu/attn_demo:
c930794736913df5a2bf3868850e15638b079073
"WOOOO T5 TRAINING! buggy training that is; the naive slopjobs in a few points in the model's t5 loss function are masking out pad tokens but not 'sentinel tokens' i mean uh 'masking tokens' i mean uh 'special tokens'. however it trains so this is the first t5 commit. WOOO!"
"dataset processing code (preliminary) to embed audio as encodec embeddings and perhaps train 'audiogpts'"
measureformer:
motivation:
"at this point im genuinely thinking itll be easier to make a bespoke transformer to do rhythm pattern imitation / interpolation than to find a sequencer which has good ux on a laptop without hundreds of dollars of human interface device peripherals""The shoegaze items"
...
"I should now have a suitably rich raw recording for generative ml"
{paragraph sized div goes here}
results:
"it's the 'malicious entity' from the famous science fiction discourse 'why you should *not* build the sonically malicious entity'.""found some surprisingly complex drum fills and decisions for how to react to out of distribution samples drawn from music unrelated to the t5 audio project."
{paragraph sized div goes here, detailing the 75hz tokenization and how literally any noise that sounds like a drum or a timbrally rich delay pedal effect is a miracle & represents dozens of dataset-adherent, rather than random, contiguous spans of decoded tokens}
commits:
{paragraph sized div goes here, remarking on implementation of t5 sampling as a surprise last minute design problem of similar complexity to implementation of t5 training. there's also a deployment infrastructure to consider: cpu decodings from BERT_BASE sized t5 models is too slow for a decent person, so sampling code is presented for ease of renting industry standard spot gpu VMs}
Commits on Sep 24, 2025
github:sqcu/attn_demo:
868d946c60f83cfd84775a5f3c5cfdf95a75ab3a
"..."
"just btw this irritating pyproject.toml ACTUALLY BUILDS on remote nodes and also on game computers with cards. a world altering achievement if it was more than a side effect of getting the inference server working. updated readme with remote spot instance inference server documentation."
"redis remote prototype"
Commits on Sep 21, 2025
... dependencies...
Commits on Sep 17, 2025
github:sqcu/attn_demo:
996793722517796a71e5dd3b6a5d0127eddd5bb6
t5 unmasking objective oriented 'infill' sampling. provide some audio sample, get a chain of reactions and variations (data dependent). github default ux experience of a 'readme' mentions more of what the repository does.
so basically use sample_audio_ty.py. and use loader.py on configs/mformer-t5-mk1.json . skipdecode is a janky mess right now i do not recommend any part of that chain of scripts right now.