Multi-backend implementations of the 30 papers from the Sutskever reading list, using the canonical NumPy repo as the baseline reference and specification.
Reference baseline:
This repository keeps the original paper numbering for compatibility, but tracks a separate build_order so implementation work can follow dependencies rather than the original list order.
Based on the Numpy-version-only:
- Python project metadata:
pyproject.toml - Demo runner:
scripts/run_paper.py - Verification telemetry generator:
scripts/generate_verification_status.py - tinygrad installer helper:
scripts/install_tinygrad.sh - Agda library file:
sutskever-30-beyond-numpy.agda-lib - Agda setup notes:
docs/AGDA_SETUP.md
Shared infrastructure:
shared/fixturesfor future shared fixtures; still reserved unless real fixture reuse emergesshared/testsfor repo-wide verification harnesses and cross-paper test modulesshared/notesfor shared note schemas and verification vocabulary
The current design keeps paper claims, invariants, proofs, and most fixtures paper-local, but shared verification method and shared note schema now live under shared/.
Per paper, the expected implementation pipeline is:
spec.mdnumpy_checks.pysympy/tinygrad/torch/jax/cubical-agda/
Rules:
NumPyis minimal and exists only for sanity checks, fixtures, baseline comparisons, and executable pseudocode.SymPyis always present, even if the note says symbolic treatment is mostly ceremonial for that paper.tinygradis always present, even if the note says the backend is mostly ceremonial for that paper.tinygradis the first minimal executable autodiff backend.- This repo defaults tinygrad to the
LLVMbackend inside the tinygrad-specific code paths when no tinygrad backend env var is already set. PyTorchis the primary executable training implementation.JAXis the second executable implementation and a cross-check on functional structure.Cubical Agdais always present, even if the note says the formalization is intentionally thin.
Agda status note:
make agda-checkmeans the Cubical Agda layer typechecks successfully.- It does not mean the whole paper is formally verified.
- In many papers here, the Agda layer is intentionally minimal and captures only a formal core, interface, or invariant slice.
Verification telemetry note:
- this repo exposes a generated
verification.yamlas an observability artifact, not as a correctness certificate - the point is to record what was present, executed, typechecked, gradient-checked at a thin level, and last refreshed
- the point is not to claim full reproduction or full formal verification
- for theory-heavy papers, the notes and telemetry can also record proxy scope and claim coverage rather than pretending the executable toy fully captures the paper
This repository is not trying to collect random backend ports. The point of the stack is that each layer answers a different question about the same paper.
NumPy -> SymPy -> tinygrad -> PyTorch -> JAX -> Cubical Agda
Read it like this:
NumPy: what is the numerical object?SymPy: what is the symbolic formula?tinygrad: what is the smallest real autodiff implementation of that formula?PyTorch: what is the practical, production-grade implementation?JAX: what does the same system look like in a functional transformation-oriented style?Cubical Agda: what can be stated and checked at the level of types, invariants, and proofs?
That ordering is deliberate. It moves from direct manipulation, to derivation, to minimal autodiff, to industrial tooling, to functional cross-checking, to formal structure.
NumPy is the foundation because it leaves very little hidden.
- Arrays, linear algebra, and eager numerical execution are explicit.
- There is no automatic differentiation engine to hide mistakes.
- If a paper uses a recurrence, an attention score, a KL term, or a convolution, a NumPy implementation forces the repository to say exactly what that object is in ordinary numerical terms.
SymPy sits directly above that because it answers a different question from NumPy.
- NumPy tells you the value for an input.
- SymPy tells you the formula, the derivative, the simplification, or the identity behind that value.
- This is the layer where the repo can justify gradient formulas, ELBO algebra, gate equations, receptive-field arithmetic, and attention-score derivations without hiding behind code alone.
tinygrad comes next because it is the first genuinely executable autodiff backend that is still small enough to feel transparent.
- It lets the repo move from hand-written math to a real framework tensor/autograd model.
- Unlike larger frameworks, the implementation remains compact enough that the backend still serves the educational goal of the repo rather than overwhelming it.
- In this project,
tinygradis the bridge between “I can derive this” and “I can run this in a framework without losing the conceptual thread.”
PyTorch remains after tinygrad, not before it.
- PyTorch is the main practical reference backend in the repo.
- It is the place where the implementation should be easiest to extend, train, debug, and compare against common practice.
- It has the richest ergonomics for most papers here, but that is exactly why it should not be the first executable layer. By the time code reaches PyTorch in this pipeline, the repo should already know what it is trying to say.
JAX comes after PyTorch because the project uses it as a second serious executable interpretation, not as the canonical first one.
- JAX forces clearer parameter/state separation.
- JAX makes the function transformation view explicit:
grad,jit,vmap, and related structure. - It is valuable as a parity backend because agreement between PyTorch and JAX catches a class of implementation drift that a single-framework repo would miss.
Cubical Agda comes last because it is not “just another backend.”
- It is the layer for signatures, invariants, structural interfaces, and proofs.
- For some papers this means a meaningful formal core.
- For others it means a deliberately thin but explicit statement of what is worth formalizing and what would be ceremonial.
- The repo keeps it present across all papers because completeness matters, but it does not pretend that every paper deserves the same proof effort.
The important point is that the layers are not interchangeable.
SymPyis not a weakertinygrad.tinygradis not a smallerPyTorch.JAXis not just “PyTorch but different syntax.”Cubical Agdais not an implementation backend in the ordinary sense at all.
They do different jobs:
NumPygives direct executable mathematics.SymPygives algebraic explanation.tinygradgives minimal autodiff execution.PyTorchgives practical implementation depth.JAXgives functional cross-verification.Cubical Agdagives formal structure.
This is why the repo uses one canonical pipeline instead of treating the backends as a flat checklist.
Adding tinygrad improves the stack because there was previously a gap between symbolic derivation and industrial frameworks.
Without tinygrad, the jump looked like this:
- symbolic formulas in
SymPy - then immediately into
PyTorchandJAX
That works, but it skips an important explanatory layer. tinygrad fills that gap by being:
- executable
- differentiable
- framework-shaped
- still small enough to remain legible
That makes it especially useful for:
- RNNs
- LSTMs
- small attention mechanisms
- compact CNNs
- VAEs and other papers where the core tensor program matters more than ecosystem integrations
It is less informative for some papers with a heavier systems or evaluation emphasis, and the repo records that explicitly in paper notes. But even there, the policy is the same as with SymPy and Cubical Agda: keep the layer present, and be honest when it is thin.
By the end of this pipeline, a paper in the repo can be understood at multiple levels:
- as direct numerical code
- as symbolic mathematics
- as a minimal autodiff program
- as a practical training implementation
- as a functional parity implementation
- as a formal object with explicit invariants
That is the real purpose of the project. It is not only to “have many implementations.” It is to make each paper legible from calculation, to derivation, to execution, to verification.
For implemented papers, the default expectation is:
NumPy: tiny checks onlySymPy: always presenttinygrad: always presentPyTorch: substantiveJAX: substantiveCubical Agda: always present
And when a layer is low-value for a paper, the repository should say so plainly in NOTES.md rather than faking depth.
This repository is most useful as a research-training and research-clarification project.
It is not primarily trying to be:
- a leaderboard repo
- a production benchmark suite
- a claim that every paper here has been reproduced at full original scale
It is trying to do something narrower and, for many researchers, more durable:
- make important ML papers executable in small form
- make their mathematics explicit rather than merely implied
- make backend agreement part of the method
- make notes about thin or ceremonial layers explicit instead of pretending every layer contributes equally
This repo is especially useful for:
- early-stage AI/ML researchers who want to move from framework fluency to first-principles understanding
- research engineers who want parity checks across multiple backend styles
- theory-minded ML readers who care about the distinction between empirical behavior, symbolic derivation, and formal structure
- teachers and self-learners who want a paper to exist as more than one code artifact
It is less useful for:
- readers who only want the fastest production implementation
- researchers whose only criterion is original-scale benchmark reproduction
- people looking for a single-framework “best practices” repo
The main value is triangulation.
A paper in this repository is not reduced to one implementation language and one style of correctness. Instead, it is seen through several different lenses:
NumPy: the smallest direct numerical statementSymPy: the algebraic or derivational statementtinygrad: the smallest real autodiff framework statementPyTorch: the practical and extensible statementJAX: the functional parity statementCubical Agda: the typed and formal statement
That means the repository can help answer different kinds of questions:
- What is this paper actually computing?
- What equations justify that computation?
- What does autodiff have to recover?
- Does the implementation survive translation across backend paradigms?
- What structural invariant is worth stating explicitly?
For current researchers, that is useful as a debugging and understanding discipline.
For future researchers, it can become a reference corpus for how to study an ML idea across multiple representational layers instead of treating “the PyTorch version” as the whole object.
This project is not the first educational implementation effort, and it is not the first multi-framework effort.
There are clear neighboring precedents:
- Dive into Deep Learning shows that educational material can be written across multiple frameworks.
- The Annotated Transformer is a classic example of deeply explanatory paper-to-code exposition.
- framework-bridging projects such as
EagerPyand multi-backend scientific ML libraries show that common logic can span several array/tensor systems. tinygraditself demonstrates the value of a small, inspectable autodiff framework.- proof-assistant work around neural-network-adjacent mathematics shows that formal methods can be brought into ML-adjacent domains.
What seems unusual here is the synthesis.
This repository deliberately combines all of the following:
- a fixed paper corpus
- a dependency-aware build order
- a standing multi-layer pipeline
- always-present symbolic and formal layers, even when thin
- backend parity as a normal expectation rather than an optional extra
- paper notes that explicitly say when a layer is low-value or ceremonial
So the claim is not “nothing like this has ever existed.”
The stronger and more defensible claim is:
- the components all have precedents
- the combination is unusual
- the method is the point
The completed repo now has a two-stage narrative.
The first stage is the corpus-building arc. It begins with small learning systems that teach the grammar of ML implementation:
26,02,03,04- direct classifiers, vanilla recurrence, gated recurrence, and regularization
It then builds the core architectural backbone:
07,10,15,11- convolution, residual routing, identity mappings, and receptive-field control
It then moves into structured sequence and reasoning systems:
14,06,08,13,16,18,20,12,21,17- alignment, pointing, set structure, attention, relational reasoning, memory, graphs, sequence alignment, and latent variables
It then reaches modern systems and retrieval behavior:
09,22,27,28,29,30- pipeline structure, scaling-law abstractions, multi-token prediction, retrieval, retrieval-augmented generation, and context-position effects
It closes with a theory-heavy tail:
05,23,25,24,01,19- pruning simplicity, description length, compressibility, capability aggregation, toy complexity dynamics, and automaton structure
The second stage is the verification arc. The repo no longer stops at “all 30 papers are present.” It now distinguishes between:
- structural presence
- thin but real universal verification
- deeper verification where the paper actually justifies it
That second arc matters. The repository now contains a shared observability layer, a repo-wide minimal gradient-parity sweep, theorem-bearing Agda files across all papers, deeper gradient and post-step parity checks for the compact high-signal papers, and explicit proxy-scope metadata where the executable object is narrower than the paper claim. The narrative therefore changes from “build the corpus” to “calibrate the evidence paper by paper.”
Narrative / Story Box The repo now reads less like a pile of ports and more like a staged research apprenticeship. First it teaches how to state the object, derive it, run it, and compare it. Then it teaches a harder lesson: not every paper deserves the same depth of proof, parity, or symbolic treatment, and a serious repo should say exactly what was checked and what remains only proxy-faithful.
The methodological arc is now clearer than the paper list itself.
The repo began with a fixed reduction for every paper:
- what is the smallest executable object?
- what is the core equation or invariant?
- what has to survive translation across backends?
- what is intentionally thin, partial, ceremonial, or only proxy-faithful?
That first method produced structural completeness. But the repo now has a second methodological layer: verification must be tiered rather than uniform.
The current method is:
- write a precise
spec.md - choose a tiny deterministic problem or toy object
- make
NumPythe smallest executable truth source - add
SymPy,tinygrad,PyTorch, andJAXas distinct explanatory and executable layers - keep
Cubical Agdapresent as a real formal layer, even when the formal slice is intentionally small - enforce a universal thin baseline across all papers:
- parity
- shape discipline
- runner coverage
- minimal gradient parity
- theorem-bearing Agda presence
- deepen selected papers where the mechanism justifies it:
- full-parameter Torch/JAX gradient agreement
- one-step updated-parameter agreement
- stronger invariant tests
- stronger Agda lemmas where the abstraction is natural
- say plainly in
NOTES.mdandverification.yamlwhen a paper is only represented through a proxy slice
That is the main methodological change. The repo is no longer trying to make every paper equally deep. It is trying to make every paper minimally real, and then selectively make the right papers stronger.
This creates three distinct verification tiers:
- universal baseline: every paper must clear a thin but real floor
- selective deepening: compact neural and structure-rich papers get stronger parity and invariant work
- proxy-aware interpretation: theory-heavy or evaluation-heavy papers get stronger scope language instead of fake executable overclaiming
That is a better method than uniformity for its own sake. It respects that some papers are naturally gradient-rich, some are naturally formalizable, and some are better represented honestly through notes and telemetry than through forced code complexity.
The shared refactor reinforces that method. Repo-wide verification patterns now live in shared space, while paper-specific invariants, proofs, and notes remain local. The architecture now mirrors the philosophy:
- repeated method is shared
- paper claims stay paper-local
Method Box The method of the repo is now triangulation with tiers: one compact spec, one tiny deterministic object, one minimal numerical truth source, several executable translations, one thin universal verification floor, and deeper verification only where the paper earns it.
Narrative / Story Box If the early story was “can the repo cover the whole reading list?”, the current story is “can it make its confidence legible?” The answer is increasingly yes, because the repo now separates presence from evidence and separates general completion from paper-specific depth.
Method Box In practice, the repo now works best when each paper answers six questions clearly: what is the smallest executable object, what is the core equation, what must survive backend translation, what is the minimum verification floor, what deeper checks are warranted here, and what part of the paper remains outside the executable regime.
The full 01..30 corpus is now populated.
Current verification status:
80Python tests passingmake agda-checkpassing across all paper formalization layersscripts/run_paper.pywired for all30papers
That does not mean all papers are implemented at equal depth. It does mean the repository is structurally complete and verified at the level this project claims.
This repository keeps a generated verification.yaml file as a repo-level verification record.
It is meant to answer:
- which layers are present for each paper
- which repo-wide checks most recently passed
- whether each paper has minimal Torch/JAX gradient parity coverage
- whether the Agda layer typechecked and contains at least one theorem-bearing definition
- whether the demo runner sweep completed
- which layers are substantive, partial, minimal, or ceremonial
- which commit was actually checked, even when the artifact is committed later
It is intentionally not framed as a certificate. The right interpretation is observability and refreshable status, not authority.
Telemetry commit fields:
checked_commitis the repo commit the checks were run againstartifact_commitis optional and exists to distinguish the commit that stores the YAML artifact from the commit that was checked- during ordinary local generation,
artifact_commitis usually leftnull
Refresh it with:
python3 scripts/generate_verification_status.py --run-checksor:
make verification-statussutskever-30-beyond-numpy/
├── README.md
├── papers.yaml
├── docs/
├── shared/ # Shared verification harnesses, note schemas, and future fixture space
├── scripts/
├── src/
│ └── s30bn/
├── templates/
│ └── paper-template.md
└── papers/
├── 01_complexity_dynamics/
├── 02_char_rnn_karpathy/
├── ...
└── 30_lost_in_middle/
Each paper directory contains:
README.mdspec.mdNOTES.mdnumpy_checks.pysympy/tinygrad/torch/jax/cubical-agda/tests/
Canonical order remains 01..30.
Recommended build order:
26CS231n02Char RNN03LSTM04RNN Regularization07AlexNet10ResNet15Identity Mappings in ResNet11Dilated Convolutions14Bahdanau Attention06Pointer Networks08Seq2Seq for Sets13Attention Is All You Need16Relational Reasoning18Relational RNN20Neural Turing Machine12Neural Message Passing21Deep Speech 2 / CTC17Variational Lossy Autoencoder09GPipe22Scaling Laws27Multi-token Prediction28Dense Passage Retrieval29Retrieval-Augmented Generation30Lost in the Middle05Keeping Neural Networks Simple23MDL Principle25Kolmogorov Complexity24Machine Super Intelligence01First Law of Complexodynamics19Coffee Automaton
The structured source of truth for this is papers.yaml.
source scripts/env.sh
make test
python3 scripts/run_paper.py --paper 01
python3 scripts/run_paper.py --paper 19
python3 scripts/run_paper.py --paper 30
make agda-checkIf agda is not already on your shell PATH, run:
source scripts/env.shTo make that persistent in zsh, add this line to ~/.zshrc:
export PATH="/Users/hifi/Library/Python/3.9/bin:$PATH"For this repo, use:
./scripts/install_tinygrad.shThat installs tinygrad without the failing optional macOS Metal dependency chain. The repo's tinygrad code then defaults to LLVM unless you explicitly choose another tinygrad backend.
Educational use under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. See individual papers for original research citations.
If you use these implementations in your work or teaching:
@misc{sutskever30beyondnumpy,
title={Sutskever 30 Beyond NumPy: Multi-Backend Educational Implementation Suite},
author={Paul "The Pageman" Pajo and collaborators},
year={2026},
note={Educational multi-backend implementations of papers from Ilya Sutskever's recommended reading list, based on the NumPy-version-only repository https://github.com/pageman/Sutskever-30-Implementations}
}