Karpathy's MicroGPT: A Full LLM in 200 Lines of Python

MAR 01AI4 MIN READ731120 COMMENTS

Andrej Karpathy published a 200-line Python file that trains and runs a GPT. No PyTorch, no NumPy, no dependencies of any kind. Everything -- autograd, the neural network architecture, tokenization, the training loop, inference -- sits in a single file that fits on three columns of a page.

The file, available as a GitHub gist and on Google Colab, is the culmination of a series Karpathy has been building for years: micrograd (a minimal autograd engine), makemore (character-level language models), nanoGPT (a clean GPT-2 implementation). MicroGPT collapses the entire stack into one place. It trains on a dataset of 32,000 names and learns to generate plausible new ones. The training converges. The inference works. The whole thing runs in pure Python.

How MicroGPT Works Under the Hood

The core mechanism is a Value class that represents a scalar and tracks its gradient. Every arithmetic operation -- addition, multiplication, tanh activation -- records itself in a computation graph. When you call backward(), the engine walks that graph in reverse topological order and applies the chain rule at each node to compute gradients. This is autograd from scratch, identical in principle to what PyTorch does, without the C++ kernel and GPU dispatch underneath. The GPT-2-like architecture sits on top of that: attention heads, layer norms, embeddings -- all built from these scalar operations. Training uses a hand-written Adam optimizer. The whole thing is slow by any practical measure, but it runs.

Everything Else Is Just Efficiency

The key insight Karpathy is demonstrating is captured in one sentence: "Everything else is just efficiency." PyTorch, CUDA, flash attention, quantization, distributed training -- those are real and important, but they are optimizations on top of a core algorithm you can write in an afternoon. The algorithmic content of an LLM is surprisingly small. Production frameworks exist to run that algorithm fast at scale, not to make the algorithm possible.

This is also the fourth iteration of the same argument at increasing compression. Micrograd proved you could build autograd in roughly 150 lines. Makemore proved you could build a character LLM from scratch. NanoGPT proved you could train a serious GPT-2 in clean PyTorch. MicroGPT collapses all of that, removing PyTorch from the picture entirely. Each step in the series makes the there-is-no-magic-here argument harder to dismiss.

Why You Should Read 200 Lines of Code

For you as a developer, the practical value is not that you should use microgpt for production work. It is that reading 200 lines gives you a ground-truth understanding of what your frameworks are actually doing. If you have been treating transformers as a black box and calling model.fit(), microgpt is the x-ray. When your training loss stops decreasing, or gradients explode, or attention patterns degenerate, you are much better equipped to diagnose it if you have once held the whole system in your head at once. The gist takes about an hour to read carefully. That hour has a high expected return.

The timing of MicroGPT is not accidental. We are roughly three years past the moment when transformers stopped being a research curiosity and became the dominant architecture for a wide range of tasks. In those three years, the frameworks, tooling, and deployment infrastructure around transformers has become sufficiently mature that most practitioners never need to understand what is happening at the scalar level. You call the API, you fine-tune the checkpoint, you deploy the endpoint. The abstraction holds. MicroGPT is Karpathy betting that the abstraction holding is not the same as the abstraction being irrelevant — that practitioners who understand the mechanism will make better decisions about when to fine-tune versus prompt, when architecture matters versus when it doesn't, when a problem is actually a transformer problem versus when you're using transformers because transformers are what everyone is currently using.

// ENGLISH
KEY POINTS:

- 200-line pure Python file, zero dependencies, full LLM in one gist
- Implements autograd from scratch via a scalar Value class
- Backprop via topological sort and chain rule, same as PyTorch under the hood
- Trains on 32,000 names dataset; generates plausible new names at inference
- GPT-2-like architecture: attention heads, layer norms, hand-written Adam optimizer
- Core insight: algorithms are small; everything else is efficiency and scale
- Culmination of micrograd, makemore, nanoGPT series