TLDR: Tokens, context windows, embeddings, RLHF, RAG, temperature — every term you keep Googling explained once, clearly, for engineers who are building with this stuff. The quick reference table at the end is the part I actually go back to.
Here's what happened when I started working with LLM APIs: I spent the first week Googling "what is a token" every time I saw it in pricing docs. Then I spent another week not quite understanding why my chatbot was getting expensive on long conversations. Then I learned what "context window" actually meant, and it all clicked at once.
This glossary is the reference I wish I'd had. No PhD required — just the vocabulary you actually need, explained for engineers who are building things.
The Fundamentals
Token
The atomic unit of text a language model processes. Not words, not characters — tokens. A token is roughly 3–4 characters of English text, but this varies by language and tokenizer. The word developer is one token; unambiguous might be two.
Why this matters to you: Every API call costs tokens — both input and output. When you see "128k context window," that's roughly 96,000 words. When you see "$3.00 per million input tokens," you're paying for tokens, not words.
"Hello, world!" → ["Hello", ",", " world", "!"] // 4 tokens
Rule of thumb: divide your character count by ~4 to estimate tokens for English text.
Context Window
The maximum amount of text (in tokens) the model can see at one time — including system prompt, conversation history, any documents you've injected, and the model's own output.
The analogy I use: think of it as RAM, not disk. Data outside the context window doesn't exist to the model. It's not stored somewhere — it's just gone.
| Model | Context Window |
|---|---|
| GPT-4o | 128k tokens |
| Claude Sonnet 4.6 | 200k tokens |
| Gemini 1.5 Pro | 1M tokens |
This is also why long conversations get expensive — every message you send includes the entire conversation history so far. Turn 20 of a chat sends 19 prior exchanges plus the current message, every single time.
Embedding
A numerical vector (array of floats) that represents the semantic meaning of text. Similar text produces vectors that are geometrically close in high-dimensional space. Semantically different text produces vectors that are far apart.
# The classic example:
embed("king") - embed("man") + embed("woman") ≈ embed("queen")
Why this matters: Embeddings are what power semantic search, recommendation systems, and RAG. When you search a vector database, you're finding vectors (and the text they represent) that are geometrically close to your query's embedding.
Inference
Running a trained model to generate output. This is what you pay for when using an API. Contrast with training, which is when the model learns — that happens before you ever touch it.
Cost breakdown for inference:
- Input tokens: Everything you send in
- Output tokens: What the model generates (usually 3–5× more expensive per token)
- Cache read tokens: Much cheaper — only on providers that support prompt caching
Model Behavior Controls
Temperature
A sampling parameter (0.0–2.0) controlling output randomness.
0.0 → always picks the highest-probability next token — deterministic
1.0 → samples proportionally from the probability distribution
2.0 → highly random, often incoherent
The practical breakdown: 0.0–0.3 for code generation, data extraction, classification — you want the same answer every time. 0.7–1.0 for creative writing or brainstorming — you want variety.
Top-p (Nucleus Sampling)
Instead of temperature, restrict sampling to the smallest set of tokens whose cumulative probability exceeds p. top_p: 0.9 means "only consider tokens accounting for 90% of the probability mass."
Most providers let you tune temperature or top-p, not both simultaneously. Pick one.
System Prompt
The instruction block that precedes the conversation. It sets the model's persona, constraints, output format, and behavioral guardrails. If you get one thing right in an LLM integration, make it the system prompt.
System: You are a senior TypeScript engineer. Respond with typed code only.
Never apologize. If you don't know something, say "I don't know."
The system prompt is what separates a random chatbot from a product that behaves consistently.
Training Concepts
Pre-training
The foundational phase where a base model learns language by predicting the next token across hundreds of billions of tokens of text. The result is a model that can complete text coherently — but one that isn't yet useful as an assistant. Think of it as learning to speak, not yet understanding how to be helpful.
Fine-tuning
Continued training on a curated, task-specific dataset to adapt a pre-trained model for a particular use case. Much cheaper than pre-training, but requires quality labeled data. Useful when you need a specific tone, output format, or domain vocabulary.
RLHF (Reinforcement Learning from Human Feedback)
The technique that transforms raw pre-trained models into helpful assistants. Human raters compare model outputs, those preferences train a reward model, and that reward model guides further training.
GPT-3.5, Claude, Gemini — all pre-trained models that went through RLHF (or a variant) to become the chat assistants you use. It's what turns "language completer" into "helpful AI."
DPO (Direct Preference Optimization)
A newer, simpler alternative to RLHF that achieves similar alignment without a separate reward model. More stable to train. Widely used in open-source models (LLaMA 3, Mistral) because of its efficiency.
Architecture Terms
Transformer
The neural network architecture underlying virtually every modern LLM. Published in the 2017 paper "Attention Is All You Need." The core innovation: self-attention, which lets every token in the context directly attend to every other token simultaneously.
Before transformers, models processed sequences left-to-right and couldn't easily capture long-range dependencies. Transformers eliminated that constraint.
Self-Attention
The mechanism by which a transformer weighs the relevance of each token relative to every other token when generating the next one. When you write "the bank by the river," self-attention helps the model know that "bank" should attend more strongly to "river" than to "money."
Parameters
The learnable numerical weights inside the model. Parameter count (7B, 70B, 405B) is a rough proxy for capability and inference cost.
GPT-2: 117M parameters (2019)
GPT-3: 175B parameters (2020)
LLaMA 3.1: 405B parameters (2024)
More parameters ≠ always better for your use case, especially when smaller models run faster and cost less.
Quantization
Reducing the precision of model weights (e.g., from 32-bit floats to 4-bit integers) to cut memory requirements and speed up inference with minimal quality loss. This is what makes it possible to run large models on consumer hardware — WebLLM in the browser uses 4-bit quantized models.
Mixture of Experts (MoE)
An architecture where the model contains many "expert" sub-networks but only activates a subset per token. Allows very large total parameter counts with much lower active parameters per inference. GPT-4, Mixtral, and Llama 4 use variants of this architecture.
Application Patterns
RAG (Retrieval-Augmented Generation)
Augmenting a model's knowledge by retrieving relevant documents from an external database at inference time and injecting them into the prompt. The model reasons over your data without being fine-tuned on it.
User query → vector search → top-k relevant chunks → LLM prompt → answer
Use RAG when: your data is dynamic, large, or needs citations. Use fine-tuning when: you need consistent tone, specialized vocabulary, or a specific output structure.
Hallucination
When a model generates confident-sounding but factually incorrect content. It's not lying — the model is optimized to produce plausible-sounding tokens, not verified facts. The confidence is the dangerous part.
Mitigation: RAG with cited sources, structured output with validation, chain-of-thought before answering, explicit uncertainty ("I'm not certain, but...").
Few-shot Learning
Providing 2–5 examples of the desired input/output pattern inside the prompt itself. No weight updates — the model generalizes from the examples in context.
Input: "Paris" → Output: "France"
Input: "Tokyo" → Output: "Japan"
Input: "Berlin" → Output: ???
Chain-of-Thought (CoT)
Prompting the model to reason step-by-step before producing a final answer. Significantly improves accuracy on multi-step reasoning tasks. The simple version: add "Think step by step before answering" to your prompt.
Tool Use / Function Calling
The model decides to call an external function rather than generating a text answer. It emits a structured JSON payload; your code executes it and returns results. This is what makes AI agents possible.
{
"tool": "get_stock_price",
"parameters": { "symbol": "NVDA" }
}
MCP (Model Context Protocol)
An open protocol (Anthropic, 2024) that standardizes how AI models connect to tools and data sources. Think of it as USB-C for AI integrations — one standard interface instead of building bespoke integrations for every tool and every model.
Prompt Caching
A feature on some providers that reduces the cost of re-sending static content (like a large system prompt) on every API call. The first call writes the cache; subsequent calls read from it at 90% lower cost. If you have a large, static system prompt, this is your biggest cost optimization.
Quick Reference
| Term | One-liner |
|---|---|
| Token | ~4 chars; the billing and context unit |
| Context window | How much text the model can see at once |
| Temperature | Randomness control — 0 for deterministic |
| Embedding | Text → vector; powers semantic search |
| RLHF | How pre-trained models become helpful |
| RAG | Inject real-time data into prompts via retrieval |
| Fine-tuning | Re-train on your data for specialization |
| Hallucination | Confident but wrong output |
| Few-shot | Give examples inside the prompt |
| CoT | Ask the model to reason before answering |
| Tool use | Model calls your code, not just text |
| MCP | Standard protocol for AI ↔ tool connectivity |
| Prompt caching | 90% cost reduction on repeated static context |
Understanding this vocabulary is the difference between reading the docs and actually knowing what you're building. Most of these concepts interact — a RAG pipeline uses embeddings, vector search, context windows, prompt engineering, and prompt caching simultaneously. Get comfortable with the terms and the architecture decisions start making sense.