What exactly is Generative AI?

Generative AI refers to artificial intelligence systems capable of creating new content — text, images, code, audio, or video — rather than simply classifying or predicting from existing data. The leap from "AI that recognizes cats" to "AI that writes code" is generative AI.

The current wave is powered by Large Language Models (LLMs) — models like GPT-4, Claude, and Gemini — trained on hundreds of billions of words scraped from the internet, books, and code repositories. These models don't "know" things the way you do. They predict what word (or token) is most likely to come next, given everything they've seen before.

Key insight: An LLM doesn't retrieve facts from a database. It generates responses token by token, based on statistical patterns learned during training. This is both its superpower and the root of hallucinations.

The 7 concepts every developer must understand

🧠

LLM

Large Language Model. A neural network trained on massive text data to generate human-like language.

🪙

Token

The atomic unit of text for an LLM. A word, sub-word, or character. "ChatGPT" ≈ 3 tokens.

📝

Prompt

The input you give the model. Prompt engineering is the craft of writing better inputs for better outputs.

🌡️

Temperature

Controls randomness. Low temp = predictable. High temp = creative (but riskier). Range: 0.0 – 2.0.

👻

Hallucination

When an LLM confidently generates false information. It's a structural risk, not a bug that can be fully patched.

📚

RAG

Retrieval-Augmented Generation. Fetch real documents, feed them to the model. Dramatically reduces hallucination.

🔢

Embedding

A numerical vector representing the meaning of text. Powers semantic search and RAG pipelines.

🪟

Context window

The maximum tokens an LLM can "see" at once. Think of it as working memory. GPT-4 Turbo: 128k tokens.

How does an LLM actually work?

At the core of every modern LLM is the Transformer architecture, introduced by Google in the landmark 2017 paper "Attention Is All You Need." The key mechanism is self-attention — the model learns which words in a sentence are most relevant to each other, no matter how far apart they are.

Training happens in two major phases. First, pre-training: the model reads enormous amounts of text and learns to predict the next token, billions of times over, across thousands of GPUs. Second, fine-tuning (often using RLHF — Reinforcement Learning from Human Feedback): the model is steered to give helpful, harmless, and honest responses through human ratings.

Temperature and sampling — the knobs that matter

When an LLM generates a response, it doesn't just pick the single most likely next token. It samples from a probability distribution. temperature controls how "peaked" or "flat" that distribution is:

Temperature Behaviour Best for
0.0 Deterministic, always picks top token Code generation, factual Q&A
0.3–0.7 Balanced, slightly creative General chat, summarisation
1.0–1.5 More varied, unpredictable Brainstorming, creative writing
2.0 Chaotic, often incoherent Rarely useful

Hallucination: the root cause, not just the symptom

Hallucination is the most misunderstood failure mode in LLMs. Developers often assume it's a bug that will eventually be fixed — but it's structural. Because LLMs generate text by predicting the next token based on patterns, they have no internal "fact-check" mechanism. They can produce confident, fluent, wrong answers.

The practical fix is RAG (Retrieval-Augmented Generation): you retrieve relevant, verified documents from your own database or the web, and inject them into the prompt. The model then generates its answer grounded in that context, not just its training data. This is the architecture behind most production AI assistants today.

Dev tip: Always validate LLM output for high-stakes domains (medical, legal, financial). Use RAG + citation to make outputs auditable. Never treat a confident response as a correct one.

Prompt engineering: the skill that actually matters right now

For intermediate developers, prompt engineering is the highest-leverage skill in the Gen AI stack. The same model can give wildly different outputs depending on how you phrase your input. Key techniques:

System prompts — set the model's persona and constraints before the conversation starts. Few-shot examples — show the model 2–3 examples of the output format you want. Chain-of-thought — ask the model to "think step by step" before answering; this dramatically improves complex reasoning. Output constraints — tell the model to respond in JSON, a specific length, or a specific tone.