LLM: Large Layers of Madness (2025)

 
Are you an entry level dev with zero clue how LLMs worked? Here’s a mental model to understand how it actually works and where it is heading.
You can learn all of this in 1 year. No PhD. Just curiosity, bookmarks, and late nights.
Start now.

Let’s start at the beginning

Text → Tokens → Embeddings

Text becomes tokens. Tokens become embeddings. Suddenly, you’re just a floating-point number drifting in 4D space. Vibe accordingly.

Positional Embeddings

  • Absolute: “I am position 5.”
  • RoPE: “I am a sine wave.”
  • Alibi: “I scale attention by distance like a hater.”
Models need position signals. Otherwise, they just see a bag of numbers.

Attention Is All You Need

  • Self-attention: “Who am I allowed to pay attention to?”
  • Multi-head: “What if I do that 8 times in parallel?”
  • QKV: query, key, value. Looks like a crypto scam, but it’s the core of intelligence.
notion image

Transformers in Action

Take inputs. Smash them through attention layers. Normalize, activate, repeat. Dump the logits. Congrats—you just inferred a token.

Sampling Tricks

How chaotic do you want your model?
  • Temperature: chaos dial.
  • Top-k: pick from the top K.
  • Top-p: pick from the smallest group whose probs sum to p.
  • Beam search? Don’t ask.

KV Cache = Cheat Code

Save past keys and values. Skip reprocessing old tokens. Turns a 90B model from meltdown mode into real-time genius.

Long Context Hacks

  • Sliding window: attention like a scanner.
  • Infini-attention: sparse like a laser sniper.
  • Memory layers: diary-style recall.
Models don’t really “remember.” They just hack around limits.

Scaling Tricks

  • MoE: Only the experts reply. Route tokens to sub-networks, light up ~3B params instead of 80B.
  • GQA: Fewer keys/values, faster inference. “Be fast without being dumb.”
  • Normalization & Activations: Layernorm, RMSnorm, GELU, ReLU, SiLU. Failed PokĂ©mon names that keep networks stable.

Training Goals

Causal LM: guess the next word.
Masked LM: guess the missing one.
Fill-in-the-middle, span prediction, instruction tuning. LLMs trained on the art of guessing—and got good at it.

Tuning

  • Finetuning: add new weights.
  • Instruction tuning: “act helpful.”
  • RLHF: vibes via human clicks.
  • DPO: direct preference optimization—do what humans upvote.

Scaling Laws

More data. More parameters. More compute. Loss goes down predictably. Intelligence is now a budget line item.

Final Boss

  • Quantization: Shrink models, run faster. GGUF, AWQ, GPTQ—zip files with extra spice.
  • Training vs Inference Stacks: Deepspeed, Megatron, FSChat (pain). vLLM, TGI, TensorRT-LLM (speed). Everyone has a repo, nobody reads the docs.
  • Synthetic Data: Models teaching themselves. Feedback loops of hallucination. Ouroboros era unlocked.

OK, what is next?

So now you’ve got the map: tokens, embeddings, attention, hacks, tuning, scaling. You can see how the pieces click. But knowing how it works only answers half the question. The real one is: where is this all going? Every few months, someone declares something “dead.” Prompt engineering is dead. RAG is dead. Let’s unpack that.
Once treated like wizardry, prompt engineering is now mostly baseline. Everyone knows how to use system prompts, few-shot examples, and chain-of-thought. Still useful, but not the moat it once was now that LLMs are getting better and better even with bad prompts.
Retrieval-Augmented Generation gets dismissed because “LLMs have giant context windows now.” Wrong. Context ≠ memory. Models forget after ~100k tokens. Facts change daily. Fine-tuning can’t keep up. RAG stays essential.
Most pipelines today are bad
just vector DB + cosine similarity. Real RAG means reranking, guardrails, telemetry, and evaluation-driven loops.
Fine-tuning makes models polite, on-brand, or specialized. But if you want up-to-date truth? That’s retrieval. Nobody wants to retrain a 70B model because a PowerPoint changed.

We’re moving from hacks to infrastructure. From “just prompt it” to systems with eval loops, telemetry, and embeddings tuned for real domains. The winners aren’t prompt wizards—they’re system builders.