Â
Are you an entry level dev with zero clue how LLMs worked? Hereâs a mental model to understand how it actually works and where it is heading.
You can learn all of this in 1 year. No PhD. Just curiosity, bookmarks, and late nights.
Start now.
Letâs start at the beginning
Text â Tokens â Embeddings
Text becomes tokens. Tokens become embeddings. Suddenly, youâre just a floating-point number drifting in 4D space. Vibe accordingly.
Positional Embeddings
- Absolute: âI am position 5.â
- RoPE: âI am a sine wave.â
- Alibi: âI scale attention by distance like a hater.â
Models need position signals. Otherwise, they just see a bag of numbers.
Attention Is All You Need
- Self-attention: âWho am I allowed to pay attention to?â
- Multi-head: âWhat if I do that 8 times in parallel?â
- QKV: query, key, value. Looks like a crypto scam, but itâs the core of intelligence.

Transformers in Action
Take inputs. Smash them through attention layers. Normalize, activate, repeat. Dump the logits. Congratsâyou just inferred a token.
Sampling Tricks
How chaotic do you want your model?
- Temperature: chaos dial.
- Top-k: pick from the top K.
- Top-p: pick from the smallest group whose probs sum to p.
- Beam search? Donât ask.
KV Cache = Cheat Code
Save past keys and values. Skip reprocessing old tokens. Turns a 90B model from meltdown mode into real-time genius.
Long Context Hacks
- Sliding window: attention like a scanner.
- Infini-attention: sparse like a laser sniper.
- Memory layers: diary-style recall.
Models donât really âremember.â They just hack around limits.
Scaling Tricks
- MoE: Only the experts reply. Route tokens to sub-networks, light up ~3B params instead of 80B.
- GQA: Fewer keys/values, faster inference. âBe fast without being dumb.â
- Normalization & Activations: Layernorm, RMSnorm, GELU, ReLU, SiLU. Failed Pokémon names that keep networks stable.
Training Goals
Causal LM: guess the next word.
Masked LM: guess the missing one.
Fill-in-the-middle, span prediction, instruction tuning. LLMs trained on the art of guessingâand got good at it.
Tuning
- Finetuning: add new weights.
- Instruction tuning: âact helpful.â
- RLHF: vibes via human clicks.
- DPO: direct preference optimizationâdo what humans upvote.
Scaling Laws
More data. More parameters. More compute. Loss goes down predictably. Intelligence is now a budget line item.
Final Boss
- Quantization: Shrink models, run faster. GGUF, AWQ, GPTQâzip files with extra spice.
- Training vs Inference Stacks: Deepspeed, Megatron, FSChat (pain). vLLM, TGI, TensorRT-LLM (speed). Everyone has a repo, nobody reads the docs.
- Synthetic Data: Models teaching themselves. Feedback loops of hallucination. Ouroboros era unlocked.
OK, what is next?
So now youâve got the map: tokens, embeddings, attention, hacks, tuning, scaling. You can see how the pieces click. But knowing how it works only answers half the question. The real one is: where is this all going? Every few months, someone declares something âdead.â Prompt engineering is dead. RAG is dead. Letâs unpack that.
Once treated like wizardry, prompt engineering is now mostly baseline. Everyone knows how to use system prompts, few-shot examples, and chain-of-thought. Still useful, but not the moat it once was now that LLMs are getting better and better even with bad prompts.
Retrieval-Augmented Generation gets dismissed because âLLMs have giant context windows now.â Wrong. Context â memory. Models forget after ~100k tokens. Facts change daily. Fine-tuning canât keep up. RAG stays essential.
Most pipelines today are badâŠjust vector DB + cosine similarity. Real RAG means reranking, guardrails, telemetry, and evaluation-driven loops.
Fine-tuning makes models polite, on-brand, or specialized. But if you want up-to-date truth? Thatâs retrieval. Nobody wants to retrain a 70B model because a PowerPoint changed.
Weâre moving from hacks to infrastructure. From âjust prompt itâ to systems with eval loops, telemetry, and embeddings tuned for real domains. The winners arenât prompt wizardsâtheyâre system builders.