What exactly is Generative AI?
Generative AI refers to artificial intelligence systems capable of creating new content — text, images, code, audio, or video — rather than simply classifying or predicting from existing data. The leap from "AI that recognizes cats" to "AI that writes code" is generative AI.
The current wave is powered by Large Language Models (LLMs) — models like GPT-4, Claude, and Gemini — trained on hundreds of billions of words scraped from the internet, books, and code repositories. These models don't "know" things the way you do. They predict what word (or token) is most likely to come next, given everything they've seen before.
The 7 concepts every developer must understand
LLM
Large Language Model. A neural network trained on massive text data to generate human-like language.
Token
The atomic unit of text for an LLM. A word, sub-word, or character. "ChatGPT" ≈ 3 tokens.
Prompt
The input you give the model. Prompt engineering is the craft of writing better inputs for better outputs.
Temperature
Controls randomness. Low temp = predictable. High temp = creative (but riskier). Range: 0.0 – 2.0.
Hallucination
When an LLM confidently generates false information. It's a structural risk, not a bug that can be fully patched.
RAG
Retrieval-Augmented Generation. Fetch real documents, feed them to the model. Dramatically reduces hallucination.
Embedding
A numerical vector representing the meaning of text. Powers semantic search and RAG pipelines.
Context window
The maximum tokens an LLM can "see" at once. Think of it as working memory. GPT-4 Turbo: 128k tokens.
How does an LLM actually work?
At the core of every modern LLM is the Transformer architecture, introduced by Google in the landmark 2017 paper "Attention Is All You Need." The key mechanism is self-attention — the model learns which words in a sentence are most relevant to each other, no matter how far apart they are.
Training happens in two major phases. First, pre-training: the model reads enormous amounts of text and learns to predict the next token, billions of times over, across thousands of GPUs. Second, fine-tuning (often using RLHF — Reinforcement Learning from Human Feedback): the model is steered to give helpful, harmless, and honest responses through human ratings.
Temperature and sampling — the knobs that matter
When an LLM generates a response, it doesn't just pick the single most likely next token. It samples from a
probability distribution. temperature controls how "peaked" or "flat" that distribution is:
| Temperature | Behaviour | Best for |
|---|---|---|
0.0 |
Deterministic, always picks top token | Code generation, factual Q&A |
0.3–0.7 |
Balanced, slightly creative | General chat, summarisation |
1.0–1.5 |
More varied, unpredictable | Brainstorming, creative writing |
2.0 |
Chaotic, often incoherent | Rarely useful |
Hallucination: the root cause, not just the symptom
Hallucination is the most misunderstood failure mode in LLMs. Developers often assume it's a bug that will eventually be fixed — but it's structural. Because LLMs generate text by predicting the next token based on patterns, they have no internal "fact-check" mechanism. They can produce confident, fluent, wrong answers.
The practical fix is RAG (Retrieval-Augmented Generation): you retrieve relevant, verified documents from your own database or the web, and inject them into the prompt. The model then generates its answer grounded in that context, not just its training data. This is the architecture behind most production AI assistants today.
Prompt engineering: the skill that actually matters right now
For intermediate developers, prompt engineering is the highest-leverage skill in the Gen AI stack. The same model can give wildly different outputs depending on how you phrase your input. Key techniques:
System prompts — set the model's persona and constraints before the conversation starts. Few-shot examples — show the model 2–3 examples of the output format you want. Chain-of-thought — ask the model to "think step by step" before answering; this dramatically improves complex reasoning. Output constraints — tell the model to respond in JSON, a specific length, or a specific tone.