AI Model Foundations and Terminology

Modern AI planning starts with a shared vocabulary. When engineers, product managers, finance partners, and legal teams align on what “model size,” “context window,” or “fine-tuned” actually mean, roadmap discussions accelerate. Use this primer to ground your team before evaluating vendors or budgeting experimentation.

Transformer architecture in plain language

Nearly every large language model you evaluate builds on the transformer architecture. Transformers read text as a sequence of tokens—little chunks of characters—and process them in parallel using attention layers. Each layer learns relationships between tokens (who references whom) and updates the model’s internal representation. More layers and parameters unlock more nuanced reasoning but also increase inference cost.

Parameters are the learned weights that store knowledge. GPT-5 is rumored to exceed 2 trillion parameters, while lightweight instruct models can succeed with fewer than 10 billion when fine-tuned on narrow tasks. Treat parameter count as a proxy for potential capability, not a guarantee of quality—training data, alignment, and inference stack design matter just as much.

Tokens, context windows, and why they matter

Tokens are the universal billing unit. A typical English sentence is 15–25 tokens. Providers price input and output tokens separately, so long prompts or verbose responses can inflate cost quickly. The AI Agent Cost Calculator normalizes these token counts so finance teams can model realistic spend.

The context window defines how many tokens the model can consider at once. GPT-5 supports 400K+ tokens while some edge deployments only handle 16K. Choosing a model with insufficient context can break your product if you plan to retrieve knowledge base passages or multi-step conversations. Context limits also impact memory strategies: you may summarise intermediate steps, chunk documents, or stream context to stay within bounds.

Base, instruct, and fine-tuned checkpoints

Vendors ship multiple checkpoints for each model family:

Base models are general-purpose and excel at open-ended synthesis but require careful prompting to stay on task.
Instruct models (sometimes called “chat” versions) are aligned to follow directions and are safer for customer-facing workflows.
Fine-tuned variants adapt the base model with domain-specific data. They’re ideal when you need compliance, tone control, or specialized knowledge baked in.

Fine-tuning isn’t the only path to specialization. Retrieval-augmented generation (RAG) keeps the base model frozen and feeds it curated context each time. Cascading between instruct models and fine-tuned adapters often yields the best balance of accuracy and cost.

Latency, throughput, and infrastructure trade-offs

Large models with high parameter counts usually run slower and cost more. Frontier APIs such as GPT-5 Pro balance latency with dedicated capacity, while community models like Llama 4 or Qwen 3.1 can be self-hosted on GPUs you control. When planning, measure:

Cold-start vs warm latency: serverless deployments spike during the first call; persistent clusters reduce jitter but cost more.
Throughput per dollar: track successful responses per $1 spent to compare options apples to apples.
Regional requirements: European or Chinese deployments may require sovereign hosting or specific vendors.

Map these characteristics back to user expectations. Support bots can tolerate 2–3 second latency; coding copilots often need sub-second completions.

How to socialize the terminology internally

We recommend introducing a one-page glossary in project kickoffs. Link to this article from your documentation portal, host a 30-minute live session with engineers, and gather questions in advance. When everyone speaks the same language, vendor evaluations, procurement reviews, and compliance audits wrap up faster.

Continue your due diligence by comparing concrete model options in the Model Library and reviewing how to structure vendor cascades in the model selection framework.

AI Model Foundations and Terminology

AI Model Foundations and Terminology

Transformer architecture in plain language

Tokens, context windows, and why they matter

Base, instruct, and fine-tuned checkpoints

Latency, throughput, and infrastructure trade-offs

How to socialize the terminology internally

Related reading