You type a sentence into ChatGPT, Claude, or Gemini and a paragraph comes back that sounds like a person wrote it. The interesting question is the next one: how do LLMs work, actually, under the hood? The honest answer is that a large language model is a very large statistical engine trained to predict the next chunk of text, one piece at a time. That single trick, scaled up enough, produces code, essays, translations, and arguments. This guide walks through every stage of that pipeline in plain English: tokens, embeddings, attention, training, inference, and the reasons these systems still get things wrong. By the end you will understand what is happening between your prompt and the answer, and you will be able to spot when an LLM is bluffing.
Table of Contents
What Is an LLM, Really
To understand how LLMs work, start from the smallest possible definition. A large language model is a neural network with billions of numeric parameters that has been trained to do one job: given a sequence of text, predict the next piece of text. Everything else, chat, coding, summarising, translating, is a side effect of being very good at that one prediction task.
The model does not store sentences. It stores weights, billions of them, arranged in layers. Your prompt is converted into numbers, those numbers flow through the layers, and the model outputs a probability for every possible next token. It picks one, appends it to the prompt, and runs again. That loop is the entire core of what these systems do.
Three Things an LLM Is Not
It is not a database. It does not look up answers. It approximates patterns from training, which is why it can sound confident about a paper that does not exist.
It is not reasoning the way humans reason. It runs the same forward pass for every token, no matter how hard the question is. Chain-of-thought prompting lets it spend more tokens “thinking out loud”, but the underlying machinery is still next-token prediction.
It is not the model you talked to last week, necessarily. Providers ship new checkpoints constantly. A response from GPT-5.3 Instant in May 2026 is not the same weights as a response from January.
Tokens and Embeddings: How LLMs See Text
Before an LLM can do anything with your prompt, it has to turn your text into numbers. This happens in two steps: tokenisation and embedding. Both are simpler than they sound and both shape how the model behaves more than people realise.
Tokenisation: Slicing Text Into Pieces
The model does not read words. It reads tokens. A token is a chunk of text, usually somewhere between a single character and a short word. The string “tokenisation” might become two tokens, “token” and “isation”. The word “cat” is one token. The phrase “Pudgy Cat” is probably three.
Most modern LLMs use Byte Pair Encoding (BPE). It starts with single characters and merges the most frequent pairs into bigger tokens, building a vocabulary of 50,000 to 200,000 tokens. Common words become single tokens. Rare words get split. This is why an LLM can write your name even if it never saw it during training: it stitches it together from smaller fragments.
Practical consequence: when an API bills you “per token”, one English token is roughly 0.75 words. A 1,000-word document is around 1,300 tokens. Code, JSON, and non-Latin scripts use more tokens per character, which is why a Python file costs more to send than a paragraph of the same byte length.
Embeddings: Turning Tokens Into Vectors
Each token in the vocabulary has a learned vector attached to it, a long list of numbers (often 4,096 dimensions in modern models). That vector is the embedding. It encodes everything the model has learned about how that token tends to behave in language.
Embeddings have a useful property: tokens with similar meaning end up with vectors close together in this high-dimensional space. The embedding for “cat” sits near “kitten” and “feline”, and far from “spreadsheet”. This geometry lets the model generalise. It does not need to have seen the exact phrase you typed, because it can land in a region of the space where similar phrases lived.
The Transformer and Self-Attention
Every modern LLM, from GPT-5 to Claude to Google’s Gemma 4, is built on the same underlying architecture: the transformer. It was introduced in a 2017 paper called “Attention Is All You Need”, and the title is the spoiler. The trick that makes transformers work, and that makes LLMs possible, is a mechanism called self-attention.
What Attention Actually Does
For each token in your prompt, the model needs to figure out which other tokens matter. Take the sentence “The cat sat on the mat because it was tired.” The word “it” refers to the cat, not the mat. A human knows this instantly. The model has to compute it.
Self-attention solves this by computing, for every token, a weighted blend of all the other tokens in the context. For “it”, the model learns to give “cat” a high weight and “mat” a lower one. The weights come from three learned projections of the embeddings called query, key, and value vectors. The intuition: every token asks every other token “are you relevant to me right now” and the most relevant ones get included in the next layer’s input.
Multi-Head Attention and Stacking
Real models do this attention computation many times in parallel, with different learned projections. Each parallel version is an attention head. Different heads pick up different relationships: subject-verb agreement, pronoun resolution, topical relevance over long distances. Modern LLMs run 32 to 128 heads per layer.
The attention output goes through a small feed-forward network, then the whole thing repeats. A modern model stacks 30 to 100 of these blocks. By the time your prompt reaches the top, the model has a deep, context-aware representation of every token.
How LLMs Are Trained
An untrained transformer is useless. It will output gibberish. The magic happens during training, which proceeds in two main phases for modern chat models: pretraining and fine-tuning.
Pretraining: The Big Crawl
Pretraining is where the bulk of “knowledge” enters the model. Engineers assemble a corpus of trillions of tokens, drawn from web crawls, books, code repositories, scientific papers, and forums. The model is asked, over and over, to predict the next token in real text. Every time it gets it wrong, an algorithm called backpropagation nudges its billions of weights very slightly in the direction that would have made the prediction better.
This phase is brutal in compute terms. A frontier model in 2026 burns through tens of thousands of GPU-years and costs hundreds of millions of dollars in electricity and chip time. That is why OpenAI just raised $122 billion and why the AI industry is in an infrastructure arms race.
Fine-Tuning and RLHF
A model that has only done pretraining is a fluent text-completion engine. Ask it a question and it might just continue the question in the style of a forum thread. To turn it into a useful assistant, providers add a second phase.
First comes supervised fine-tuning, where human writers craft thousands of example conversations showing the model how a good assistant responds. Then comes Reinforcement Learning from Human Feedback, or RLHF. Humans rate pairs of model outputs, the model learns a reward signal that approximates human preferences, and the weights get adjusted to maximise that reward. RLHF is why ChatGPT sounds polite, refuses certain requests, and finishes with a summary even when you did not ask.
Not everyone agrees this approach scales to real intelligence. Yann LeCun raised a billion dollars to argue that next-token prediction is a dead end and that world models are the future. The debate is unresolved.
Inference: From Prompt to Reply
Once a model is trained, the weights are frozen and the model is deployed for inference. Inference is what happens every time you send a prompt. The process is mechanical and predictable, and understanding it explains a lot of LLM quirks.
The Generation Loop
Your prompt gets tokenised. The tokens flow through every layer of the model. At the output, the model produces a probability distribution over the entire vocabulary, a number for every one of the 100,000-or-so possible next tokens. The system picks one (more on that below), appends it to the sequence, and runs the whole forward pass again with the new, slightly longer input. This continues until the model generates a special “stop” token or hits a length limit.
Each token requires a full pass through the network. That is why responses stream in word by word and why long answers take longer. It is also why running these models is expensive: each output token costs a full forward pass through tens of billions of parameters.
Temperature and Sampling
The model gives probabilities for every possible next token, but still has to pick one. Greedy decoding always picks the most likely token, which tends to produce repetitive output, so providers use sampling.
Temperature flattens or sharpens the probability distribution. Low temperature near zero is deterministic and conservative. High temperature around one is varied and creative. Top-k and top-p sampling restrict the choice to a shortlist before sampling. This is why running the same prompt twice can produce different answers: randomness is baked into the loop.
Why LLMs Hallucinate
Now that you know how LLMs work, hallucinations stop being mysterious. An LLM is not retrieving facts. It is generating the most statistically plausible continuation of your prompt. If the most plausible continuation happens to be true, you get a correct answer. If the most plausible continuation is a confident-sounding fabrication, you get a hallucination. The model has no internal “I do not know” signal that it consults before answering.
Common Hallucination Failure Modes
Fake citations: when asked for sources, the model generates strings that look like real academic references because that pattern is overwhelmingly common in its training data. The format is correct, the authors and titles are invented.
Confident wrong dates: numbers, especially years, are easy to get wrong because the model only has rough statistical associations, not a calendar.
Phantom features: ask an LLM how to do something in a software tool and it might invent a plausible API call that does not exist, because the pattern “library has method that does this” is strong.
How to Reduce Hallucinations
The most effective fix in production is to stop relying on the model’s own knowledge for facts. Instead, feed it the facts at query time. That is the whole idea behind retrieval-augmented generation. We covered the mechanics in a separate guide on how RAG works: pull the relevant documents from a vector database, paste them into the prompt, ask the model to answer using only that material. Hallucinations drop sharply.
Context Windows, RAG, and Tool Use
A modern LLM does not just sit in isolation. The serious productivity gains in 2026 come from giving it access to context and tools. Three pieces matter here.
The Context Window
Every model has a maximum number of tokens it can process in one pass, called the context window. Early GPT-3 had 2,048 tokens. Modern frontier models advertise 200,000 to 2 million. That is the cap on what the model can see at any moment. Anything older is gone, unless the application stores it externally and replays it.
Retrieval Augmented Generation
Because the context window is finite and pretraining is frozen at some date in the past, almost every serious LLM deployment uses RAG to bring fresh knowledge into the prompt at query time. The question gets embedded into a vector, a search runs against a knowledge base, the top results get inserted into the prompt, and the model answers from that grounded context. This is how enterprise AI handles current data without retraining.
Tool Use and MCP
The other half of the story is letting the model call external tools. A model can be given a list of functions (search the web, run code, query a database) and trained to emit a structured request whenever it needs one. The runtime executes the call, feeds the result back into the context, and the model continues generating. The protocol that standardised this in late 2025 and dominates in 2026 is the Model Context Protocol. We unpacked it in a full MCP explainer. If you want to play with this stack on your own hardware, the guide on running AI locally walks through Ollama, llama.cpp, and getting a model live on a laptop.
FAQ
How do LLMs work in one sentence?
An LLM is a giant neural network that has been trained to predict the next token in a sequence, and “chatting” is just that prediction loop running over and over with your prompt as the seed.
What is the difference between an LLM and AI?
AI is the broader field, including image recognition, robotics, search, and game-playing systems. An LLM is one specific type of AI: a large language model trained on text to generate text. ChatGPT, Claude, and Gemini are LLM-based products.
Do LLMs actually understand language?
Depends what you mean by understand. They build statistical representations of language that capture grammar, common reasoning patterns, and semantic relationships well enough to be genuinely useful. They do not have grounded experience, sensory input, or beliefs about the world. Researchers disagree on whether that gap matters in practice.
Why do LLMs make stuff up?
Because they are optimised to produce plausible text, not true text. When the model has no real information to draw on, generating something plausible is its default behaviour. Hallucinations are not bugs, they are the natural output of the next-token objective applied to a question the model cannot ground.
Can I train my own LLM?
Pretraining a frontier model from scratch is out of reach for individuals, the compute alone runs into nine figures. Fine-tuning a small open-weight model on your own data is cheap and works on consumer GPUs. Running an existing model locally with Ollama or llama.cpp is the easiest entry point.
The Takeaway
Strip away the marketing and an LLM is a token-prediction engine wrapped in attention layers, trained on most of the internet, and steered with human feedback. That is the whole thing. The reason it feels like talking to a thinking system is that next-token prediction, at sufficient scale, captures a stunning amount of how human language works. The reason it sometimes fails badly is that the same machinery has no concept of truth. Knowing the difference is now a basic literacy. The next time a chatbot answers with suspicious confidence, you will know exactly which gear is spinning.
🐾 Visit the Pudgy Cat Shop for prints and cat-approved goodies, or find our illustrated books on Amazon.





Leave a Reply