Build a RAG system in 90 minutes

0 / 6 · 0% complete

What we're building: a tiny chatbot that can answer questions about a set of documents you provide. You upload some text, it finds the most relevant bits, and uses a language model to write an answer grounded in what you uploaded. That's RAG — retrieval, augmentation, generation. Three words, one pipeline.

What you'll need

Everything below is free or nearly free to try.

Python 3.10+Most modern installs work.
An OpenAI API key~$1 of credit is plenty.
A terminalAnywhere you can run commands.
90 minutesReally. We timed it.

What is RAG, really?

A plain-language model (ChatGPT, Claude, whatever) only knows what it learned during training. It doesn't know about your company's docs, your product, your Slack history. To make it useful to you, you have to hand it the relevant context at runtime.

RAG is just that. When a question comes in, you search through your own documents, find the most relevant chunks, and paste them into the prompt along with the question. The model answers using your data.

The search part is the trick. We don't search by keywords — we search by meaning, using embeddings. Every document gets turned into a vector of numbers (an "embedding"); the question is too; and we find the documents whose vectors sit closest to the question's vector in high-dimensional space. Similar meaning, similar coordinates. That's the whole game.

Set up your environment

Done

Make a new folder, open a terminal in it, and create a Python virtual environment. A virtual environment is just a sandboxed copy of Python — it keeps your project's dependencies from mixing with the rest of your machine.

bash

mkdir rag-demo && cd rag-demo
python3 -m venv .venv
source .venv/bin/activate     # on Windows: .venv\Scripts\activate

pip install openai faiss-cpu tiktoken

Those four packages are all you need:

openai — the Python client for OpenAI's API.
faiss-cpu — Facebook's vector database. Fast, local, no setup.
tiktoken — tells you how many tokens your text uses (so you can stay within limits).

Now put your API key in your shell so the code can read it:

bash

export OPENAI_API_KEY="sk-..."

Prepare your documents

Done

For this tutorial, use anything — your FAQ, your company wiki, a blog, even Wikipedia articles. Put the text into a folder called docs/, one file per document.

Real documents are too long to embed as single vectors — we chunk them first. A good default: chunks of 400-600 words, with 50 words of overlap between chunks. Create a file called chunk.py:

python

import os
from pathlib import Path

def chunk_text(text, size=500, overlap=50):
    """Split text into overlapping word chunks."""
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i : i + size])
        chunks.append(chunk)
        i += size - overlap
    return chunks

def load_and_chunk(folder="docs"):
    all_chunks = []
    for path in Path(folder).glob("*.txt"):
        text = path.read_text(encoding="utf-8")
        for idx, chunk in enumerate(chunk_text(text)):
            all_chunks.append({
                "source": path.name,
                "chunk_id": idx,
                "text": chunk,
            })
    return all_chunks

if __name__ == "__main__":
    chunks = load_and_chunk()
    print(f"Loaded {len(chunks)} chunks from docs/")

Run it:

bash

python chunk.py

Tip

Chunk size matters more than people admit. Too small and you lose context. Too large and retrieval gets fuzzy. 500 words is a safe starting point for prose; for code or tables you want smaller, more structured chunks.

Embed the chunks

Done

Now we turn every chunk into a vector. Create embed.py:

python

import json
import numpy as np
from openai import OpenAI
from chunk import load_and_chunk

client = OpenAI()
MODEL = "text-embedding-3-small"   # cheap, fast, good enough

def embed_batch(texts):
    """Call the API in one batch for efficiency."""
    response = client.embeddings.create(
        model=MODEL,
        input=texts,
    )
    return [np.array(d.embedding, dtype="float32") for d in response.data]

def main():
    chunks = load_and_chunk()
    print(f"Embedding {len(chunks)} chunks...")
    vectors = embed_batch([c["text"] for c in chunks])

    # Save vectors + metadata side by side
    np.save("vectors.npy", np.stack(vectors))
    with open("meta.json", "w") as f:
        json.dump(chunks, f)
    print("Saved vectors.npy + meta.json")

if __name__ == "__main__":
    main()

Run it:

bash

python embed.py

You just spent about a penny and turned your documents into an array of floating-point numbers. Congratulations.

Build the vector index

Done

FAISS is a library for fast similarity search. For a few thousand chunks it's overkill; for a few million it's essential. Either way, it's easy to use. Create index.py:

python

import json
import numpy as np
import faiss

def build_index():
    vectors = np.load("vectors.npy")
    dim = vectors.shape[1]
    index = faiss.IndexFlatIP(dim)          # inner-product search
    faiss.normalize_L2(vectors)             # so IP == cosine similarity
    index.add(vectors)
    faiss.write_index(index, "rag.index")
    print(f"Indexed {index.ntotal} vectors of dim {dim}")

def load_index():
    index = faiss.read_index("rag.index")
    with open("meta.json") as f:
        meta = json.load(f)
    return index, meta

if __name__ == "__main__":
    build_index()

Run it once to build the index:

bash

python index.py

Query with retrieval

Done

Now the fun part. A user types a question. We embed the question, find the closest chunks, and return them. Create search.py:

python

import numpy as np
import faiss
from openai import OpenAI
from index import load_index

client = OpenAI()

def search(query, k=4):
    index, meta = load_index()

    # Embed the query with the same model used for chunks
    emb = client.embeddings.create(
        model="text-embedding-3-small",
        input=[query],
    ).data[0].embedding
    q = np.array([emb], dtype="float32")
    faiss.normalize_L2(q)

    scores, idxs = index.search(q, k)
    results = []
    for score, i in zip(scores[0], idxs[0]):
        chunk = meta[int(i)]
        results.append({
            "score": float(score),
            "source": chunk["source"],
            "text": chunk["text"],
        })
    return results

if __name__ == "__main__":
    import sys
    query = " ".join(sys.argv[1:]) or "what is this about?"
    for r in search(query):
        print(f"\n[{r['score']:.3f}] {r['source']}")
        print(r["text"][:200] + "...")

Try it:

bash

python search.py "how does the thing work?"

You should see the top 4 most relevant chunks of your documents. That's retrieval.

Generate the answer

Done

The final piece: take the retrieved chunks, shove them into a prompt, and ask a language model to answer. Create ask.py:

python

import sys
from openai import OpenAI
from search import search

client = OpenAI()

SYSTEM = """You are a careful assistant. Answer the user's question using ONLY
the context below. If the answer isn't in the context, say so honestly.
Cite the source filename in brackets after each claim, e.g. [guide.txt]."""

def ask(question):
    chunks = search(question, k=4)
    context = "\n\n---\n\n".join(
        f"[{c['source']}]\n{c['text']}" for c in chunks
    )
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
    ]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.2,
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    q = " ".join(sys.argv[1:]) or "what is this about?"
    print(ask(q))

Try it:

bash

python ask.py "explain the main idea in one paragraph"

You just built a RAG system. The model is using your documents to answer. It will cite its sources. It will say "I don't know" when the context doesn't contain the answer.

Tip

The temperature=0.2 setting makes the model more conservative. For a factual RAG chatbot, low temperature is your friend. For a creative assistant, crank it up.

You did the thing

You now have a working RAG pipeline: chunks → embeddings → index → retrieve → generate. Every production RAG system starts here. The rest is evaluation, scaling, and guardrails.

What to do next

This is the "it works on my machine" version. For production you'll want to:

Evaluate — write test questions with expected answers and measure pass rate. Don't ship changes that regress.
Observe — log every question, retrieved chunks, and the final answer. Read our observability starter kit for the how.
Guardrail — add input validation, rate limits, and output filters.
Scale — swap FAISS for a managed vector database (Pinecone, Qdrant, pgvector) when you exceed a million chunks.
Govern — document what data is in the index, who can query it, and your retention policy. See the NIST AI RMF shipping checklist.

If you want help shipping a production version — including the evals, the observability, and the governance — write to us. We do senior-led engagements only. The quote you'll get is the quote you'll pay.

Build a RAG system in 90 minutes.

What is RAG, really?

Set up your environment

Prepare your documents

Embed the chunks

Build the vector index

Query with retrieval

Generate the answer

What to do next

Next up.