LLMs Explained: Context Stuffing vs. Updating Model Weights

This workshop explores how Large Language Models (LLMs) manage knowledge, distinguishing between ephemeral context and persistent weights. It critiques current reliance on large context windows and RAG for "long-tail" knowledge due to cost and latency. The core argument advocates for efficiently injecting specific knowledge directly into model weights to improve performance for niche tasks.

A practical look at LLM memory: stuffing context vs updating weights

If you’ve ever shoved a 10,000‑word doc into a prompt and watched the model choke, you’re not alone. Jack Morris’s workshop digs into that familiar pain, and why the typical fixes — bigger context windows or Retrieval Augmented Generation (RAG) — aren’t always the best route.

Here’s the gist, plain and simple. LLMs have two kinds of memory: the short‑lived context you feed them, and the persistent knowledge baked into their weights. Morris lays out three ways to inject knowledge:
Full Context, which means putting everything directly into the prompt, fine for tiny domains but expensive and slow.
RAG, which fetches chunks of info as needed, useful but adds latency and complexity.
Training into Weights, the bold idea of actually updating the model parameters with niche knowledge, so the model “remembers” without carrying huge context every time.

Two striking points stood out. First, the „Context Trap“: expanding the context balloon raises compute and latency dramatically. Morris gives numbers: roughly 10,000 tokens/second output with 1,000 tokens of context, but that throughput can collapse to about 130 tokens/second with a 128k context window. Ouch. Second, “Context Rot”: the model’s reasoning can degrade as the context grows. So bigger isn’t always smarter.

In the Q&A, Morris even suggests Federated Learning might make a comeback, because you often only need to tweak a small slice of parameters (a million, not a trillion), which makes syncing updates practical for specialized use cases.

I’ve tried both sides — prompt stuffing for a quick fix, and small fine‑tunes for production. The latter feels cleaner. If you care about latency and accuracy for niche tasks, injecting knowledge into weights is worth exploring.

Watch the full talk here: https://youtu.be/Jty4s9-Jb78?si=c9cv4Z5ySGvz71J-

Bottom line, we’ll probably move to a hybrid future: lean models that reason well, plus targeted weight updates for what really matters. That feels hopeful, and practical.

Kommentar abschicken