AI Terminology for Developers: A Practical Glossary

TLDR: Tokens, context windows, embeddings, RLHF, RAG, temperature — every term you keep Googling explained once, clearly, for engineers who are building with this stuff. The quick reference table at the end is the part I actually go back to.

Here's what happened when I started working with LLM APIs: I spent the first week Googling "what is a token" every time I saw it in pricing docs. Then I spent another week not quite understanding why my chatbot was getting expensive on long conversations. Then I learned what "context window" actually meant, and it all clicked at once.

This glossary is the reference I wish I'd had. No PhD required — just the vocabulary you actually need, explained for engineers who are building things.

The Fundamentals

Token

The atomic unit of text a language model processes. Not words, not characters — tokens. A token is roughly 3–4 characters of English text, but this varies by language and tokenizer. The word developer is one token; unambiguous might be two.

Why this matters to you: Every API call costs tokens — both input and output. When you see "128k context window," that's roughly 96,000 words. When you see "$3.00 per million input tokens," you're paying for tokens, not words.

"Hello, world!" → ["Hello", ",", " world", "!"]   // 4 tokens

Rule of thumb: divide your character count by ~4 to estimate tokens for English text.

Context Window

The maximum amount of text (in tokens) the model can see at one time — including system prompt, conversation history, any documents you've injected, and the model's own output.

The analogy I use: think of it as RAM, not disk. Data outside the context window doesn't exist to the model. It's not stored somewhere — it's just gone.

Model	Context Window
GPT-4o	128k tokens
Claude Sonnet 4.6	200k tokens
Gemini 1.5 Pro	1M tokens

This is also why long conversations get expensive — every message you send includes the entire conversation history so far. Turn 20 of a chat sends 19 prior exchanges plus the current message, every single time.

Embedding

A numerical vector (array of floats) that represents the semantic meaning of text. Similar text produces vectors that are geometrically close in high-dimensional space. Semantically different text produces vectors that are far apart.

# The classic example:
embed("king") - embed("man") + embed("woman") ≈ embed("queen")

Why this matters: Embeddings are what power semantic search, recommendation systems, and RAG. When you search a vector database, you're finding vectors (and the text they represent) that are geometrically close to your query's embedding.

Inference

Running a trained model to generate output. This is what you pay for when using an API. Contrast with training, which is when the model learns — that happens before you ever touch it.

Cost breakdown for inference:

Input tokens: Everything you send in
Output tokens: What the model generates (usually 3–5× more expensive per token)
Cache read tokens: Much cheaper — only on providers that support prompt caching

Model Behavior Controls

Temperature

A sampling parameter (0.0–2.0) controlling output randomness.

0.0  →  always picks the highest-probability next token — deterministic
1.0  →  samples proportionally from the probability distribution
2.0  →  highly random, often incoherent

The practical breakdown: 0.0–0.3 for code generation, data extraction, classification — you want the same answer every time. 0.7–1.0 for creative writing or brainstorming — you want variety.

Top-p (Nucleus Sampling)

Instead of temperature, restrict sampling to the smallest set of tokens whose cumulative probability exceeds p. top_p: 0.9 means "only consider tokens accounting for 90% of the probability mass."

Most providers let you tune temperature or top-p, not both simultaneously. Pick one.

System Prompt

The instruction block that precedes the conversation. It sets the model's persona, constraints, output format, and behavioral guardrails. If you get one thing right in an LLM integration, make it the system prompt.

System: You are a senior TypeScript engineer. Respond with typed code only.
        Never apologize. If you don't know something, say "I don't know."

The system prompt is what separates a random chatbot from a product that behaves consistently.

Training Concepts

Pre-training

The foundational phase where a base model learns language by predicting the next token across hundreds of billions of tokens of text. The result is a model that can complete text coherently — but one that isn't yet useful as an assistant. Think of it as learning to speak, not yet understanding how to be helpful.

Fine-tuning

Continued training on a curated, task-specific dataset to adapt a pre-trained model for a particular use case. Much cheaper than pre-training, but requires quality labeled data. Useful when you need a specific tone, output format, or domain vocabulary.

RLHF (Reinforcement Learning from Human Feedback)

The technique that transforms raw pre-trained models into helpful assistants. Human raters compare model outputs, those preferences train a reward model, and that reward model guides further training.

GPT-3.5, Claude, Gemini — all pre-trained models that went through RLHF (or a variant) to become the chat assistants you use. It's what turns "language completer" into "helpful AI."

DPO (Direct Preference Optimization)

A newer, simpler alternative to RLHF that achieves similar alignment without a separate reward model. More stable to train. Widely used in open-source models (LLaMA 3, Mistral) because of its efficiency.

Architecture Terms

Transformer

The neural network architecture underlying virtually every modern LLM. Published in the 2017 paper "Attention Is All You Need." The core innovation: self-attention, which lets every token in the context directly attend to every other token simultaneously.

Before transformers, models processed sequences left-to-right and couldn't easily capture long-range dependencies. Transformers eliminated that constraint.

Self-Attention

The mechanism by which a transformer weighs the relevance of each token relative to every other token when generating the next one. When you write "the bank by the river," self-attention helps the model know that "bank" should attend more strongly to "river" than to "money."

Parameters

The learnable numerical weights inside the model. Parameter count (7B, 70B, 405B) is a rough proxy for capability and inference cost.

GPT-2:      117M parameters  (2019)
GPT-3:      175B parameters  (2020)
LLaMA 3.1:  405B parameters  (2024)

More parameters ≠ always better for your use case, especially when smaller models run faster and cost less.

Quantization

Reducing the precision of model weights (e.g., from 32-bit floats to 4-bit integers) to cut memory requirements and speed up inference with minimal quality loss. This is what makes it possible to run large models on consumer hardware — WebLLM in the browser uses 4-bit quantized models.

Mixture of Experts (MoE)

An architecture where the model contains many "expert" sub-networks but only activates a subset per token. Allows very large total parameter counts with much lower active parameters per inference. GPT-4, Mixtral, and Llama 4 use variants of this architecture.

Application Patterns

RAG (Retrieval-Augmented Generation)

Augmenting a model's knowledge by retrieving relevant documents from an external database at inference time and injecting them into the prompt. The model reasons over your data without being fine-tuned on it.

User query → vector search → top-k relevant chunks → LLM prompt → answer

Use RAG when: your data is dynamic, large, or needs citations. Use fine-tuning when: you need consistent tone, specialized vocabulary, or a specific output structure.

Hallucination

When a model generates confident-sounding but factually incorrect content. It's not lying — the model is optimized to produce plausible-sounding tokens, not verified facts. The confidence is the dangerous part.

Mitigation: RAG with cited sources, structured output with validation, chain-of-thought before answering, explicit uncertainty ("I'm not certain, but...").

Few-shot Learning

Providing 2–5 examples of the desired input/output pattern inside the prompt itself. No weight updates — the model generalizes from the examples in context.

Input: "Paris" → Output: "France"
Input: "Tokyo" → Output: "Japan"
Input: "Berlin" → Output: ???

Chain-of-Thought (CoT)

Prompting the model to reason step-by-step before producing a final answer. Significantly improves accuracy on multi-step reasoning tasks. The simple version: add "Think step by step before answering" to your prompt.

Tool Use / Function Calling

The model decides to call an external function rather than generating a text answer. It emits a structured JSON payload; your code executes it and returns results. This is what makes AI agents possible.

{
  "tool": "get_stock_price",
  "parameters": { "symbol": "NVDA" }
}

MCP (Model Context Protocol)

An open protocol (Anthropic, 2024) that standardizes how AI models connect to tools and data sources. Think of it as USB-C for AI integrations — one standard interface instead of building bespoke integrations for every tool and every model.

Prompt Caching

A feature on some providers that reduces the cost of re-sending static content (like a large system prompt) on every API call. The first call writes the cache; subsequent calls read from it at 90% lower cost. If you have a large, static system prompt, this is your biggest cost optimization.

Quick Reference

Term	One-liner
Token	~4 chars; the billing and context unit
Context window	How much text the model can see at once
Temperature	Randomness control — 0 for deterministic
Embedding	Text → vector; powers semantic search
RLHF	How pre-trained models become helpful
RAG	Inject real-time data into prompts via retrieval
Fine-tuning	Re-train on your data for specialization
Hallucination	Confident but wrong output
Few-shot	Give examples inside the prompt
CoT	Ask the model to reason before answering
Tool use	Model calls your code, not just text
MCP	Standard protocol for AI ↔ tool connectivity
Prompt caching	90% cost reduction on repeated static context

Understanding this vocabulary is the difference between reading the docs and actually knowing what you're building. Most of these concepts interact — a RAG pipeline uses embeddings, vector search, context windows, prompt engineering, and prompt caching simultaneously. Get comfortable with the terms and the architecture decisions start making sense.

LLM API Cost Guide: What You Actually Pay in 2026

Real pricing across Claude, GPT-4o, and Gemini — cost scenarios for chatbots, code review tools, and side projects, prompt caching savings, model routing strategy, and how to instrument costs before your bill surprises you.

12 min read ·May 10, 2026

Read

AILLMEngineering

Intermediate

Prompt Engineering for Developers: From Basics to Production

System prompt architecture, few-shot patterns, chain-of-thought, structured output, prompt injection defense, and building an eval loop.

12 min read ·Apr 29, 2025

Read

AngularSignalsAIArchitecture

Intermediate

Angular Signals: The Missing Architectural Layer Between LLMs and Your UI

When an LLM generates a response, it's producing state — not text. Signals are the cleanest boundary I've found between probabilistic AI systems and deterministic UIs. Here's the architecture that makes it click.

5 min read ·Jun 6, 2026

Read

Back to all posts

The Fundamentals

Token

Context Window

Embedding

Inference

Model Behavior Controls

Temperature

Top-p (Nucleus Sampling)

System Prompt

Training Concepts

Pre-training

Fine-tuning

RLHF (Reinforcement Learning from Human Feedback)

DPO (Direct Preference Optimization)

Architecture Terms

Transformer

Self-Attention

Parameters

Quantization

Mixture of Experts (MoE)

Application Patterns

RAG (Retrieval-Augmented Generation)

Hallucination

Few-shot Learning

Chain-of-Thought (CoT)

Tool Use / Function Calling

MCP (Model Context Protocol)

Prompt Caching

Quick Reference

LLM API Cost Guide: What You Actually Pay in 2026

Prompt Engineering for Developers: From Basics to Production

Angular Signals: The Missing Architectural Layer Between LLMs and Your UI

Stay in the loop