Building RAG: A Hello World Example

Table of Contents

1 Overview
2 What is RAG?
3 The general flow of a RAG application
4 Prerequisites
5 The Complete Code
6 Breaking Down the Code
7 Running the Example
8 Why This Matters
9 Limitations and Next Steps
10 Conclusion

Overview

Retrieval-Augmented Generation (RAG) has become one of the most powerful techniques in modern AI applications. Instead of relying solely on an LLM’s training data, RAG allows models to access external knowledge by retrieving relevant information before generating responses.

In this tutorial, we’ll build a minimal RAG system from scratch in about 60 lines of Python. By the end, you’ll understand how RAG works under the hood and have a working example you can modify for your own projects.

What is RAG?

RAG combines two key components:

Retrieval: Finding relevant information from a knowledge base
Generation: Using an LLM to generate answers based on that retrieved information

Think of it as giving your AI a reference library. Instead of answering from memory alone, it can look up relevant passages first, leading to more accurate and grounded responses.

The general flow of a RAG application

Prerequisites

Before starting, you’ll need:

Python 3.7+
Ollama installed and running locally
Two Ollama models downloaded:
- nomic-embed-text (for embeddings)
- gemma3:1b (for chat)

Install the models with:

ollama pull nomic-embed-text
ollama pull gemma3:1b

You’ll also need these Python packages:

pip install requests numpy

The Complete Code

Here’s our complete RAG implementation:

import requests 
import numpy as np

OLLAMA_URL = "http://localhost:11434/api"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "gemma3:1b"

full_text = """
The Apollo 11 mission launched on July 16, 1969. 
Neil Armstrong was the first person to walk on the moon. 
Buzz Aldrin joined him 19 minutes later. 
Michael Collins stayed in orbit in the command module. 
Michael Jackson also stayed at home, in his ship too.
The total duration of the lunar surface EVA was 2 hours and 31 minutes.
"""

chunks = [chunk.strip() for chunk in full_text.split('.') if chunk.strip()]

def get_embedding(text):
    response = requests.post(
        f"{OLLAMA_URL}/embeddings",
        json={"model": EMBED_MODEL, "prompt": text}
    )
    eb = response.json()["embedding"]
    print(f"Embedding done")
    return eb

print("Building vector...")
vector_store = []
for chunk in chunks:
    vector = get_embedding(chunk)
    vector_store.append({
        "text": chunk,
        "vector": np.array(vector)
    })

def retrieve(question):
    question_vector = np.array(get_embedding(question))
    best_chunk = None
    best_score = -1
    
    for item in vector_store:
        score = np.dot(item["vector"], question_vector) / (
            np.linalg.norm(item["vector"]) * np.linalg.norm(question_vector)
        )
        if score > best_score:
            best_score = score
            best_chunk = item["text"]
    
    return best_chunk

user_query = "Who stayed in the ship that was docked in the orbit?"
print(f"\n User: {user_query}")

context = retrieve(user_query)
print(f"\n Retrieved context: {context}")

prompt = f"""
Use this context to answer the question.
Context: {context}
Question: {user_query}
"""

response = requests.post(
    f"{OLLAMA_URL}/chat",
    json={
        "model": CHAT_MODEL,
        "stream": False,
        "messages": [
            {"role": "user", "content": prompt}
        ]
    }
)

print("Answer:", response.json()["message"]["content"])

Breaking Down the Code

Step 1: Document Chunking

chunks = [chunk.strip() for chunk in full_text.split('.') if chunk.strip()]

We split our knowledge base into smaller pieces. In this simple example, we split by periods, but production systems typically use more sophisticated chunking strategies based on semantic boundaries, token limits, or sliding windows.

Step 2: Creating Embeddings

def get_embedding(text):
    response = requests.post(
        f"{OLLAMA_URL}/embeddings",
        json={"model": EMBED_MODEL, "prompt": text}
    )
    return response.json()["embedding"]

Embeddings are numerical representations of text that capture semantic meaning. Similar texts have similar embeddings. The nomic-embed-text model converts each chunk into a high-dimensional vector.

Step 3: Building the Vector Store

vector_store = []
for chunk in chunks:
    vector = get_embedding(chunk)
    vector_store.append({
        "text": chunk,
        "vector": np.array(vector)
    })

We create our knowledge base by storing each chunk alongside its embedding vector. In production, you’d use specialized vector databases like Pinecone, Weaviate, or ChromaDB, but a simple list works for learning.

Step 4: Semantic Retrieval

def retrieve(question):
    question_vector = np.array(get_embedding(question))
    best_chunk = None
    best_score = -1
    
    for item in vector_store:
        score = np.dot(item["vector"], question_vector) / (
            np.linalg.norm(item["vector"]) * np.linalg.norm(question_vector)
        )
        if score > best_score:
            best_score = score
            best_chunk = item["text"]
    
    return best_chunk

This is where the magic happens. We:

Convert the user’s question into an embedding
Calculate cosine similarity between the question and each chunk
Return the most similar chunk

Cosine similarity measures the angle between vectors, giving us a score between -1 and 1. Higher scores mean more similar content.

Step 5: Augmented Generation

prompt = f"""
Use this context to answer the question.
Context: {context}
Question: {user_query}
"""

response = requests.post(
    f"{OLLAMA_URL}/chat",
    json={
        "model": CHAT_MODEL,
        "stream": False,
        "messages": [{"role": "user", "content": prompt}]
    }
)

Finally, we provide both the retrieved context and the question to the LLM. This grounds the model’s response in actual retrieved information rather than relying on potentially outdated or incorrect training data.

Running the Example

When you run this code, you’ll see output like:

Building vector...
Embedding done
Embedding done
...

User: Who stayed in the ship that was docked in the orbit?

Retrieved context: Michael Collins stayed in orbit in the command module

Answer: Michael Collins stayed in the ship docked in orbit.

Notice how the system correctly identifies Michael Collins, even though the text also mentions Michael Jackson staying “in his ship” (a humorous distractor). The semantic search finds the most relevant context.

Why This Matters

Without RAG, the LLM would have to answer from its training data alone. With RAG:

Accuracy: Answers are grounded in your specific documents
Currency: You can update knowledge without retraining
Transparency: You can see which sources informed the answer
Reduced hallucination: The model works from real retrieved text

Limitations and Next Steps

This hello-world example simplifies many aspects of production RAG systems:

What’s Missing:

Error handling
Chunk size optimization
Top-k retrieval (returning multiple chunks)
Reranking retrieved results
Persistent vector storage
Metadata filtering
Query expansion and rewriting

Production Improvements:

Use vector databases (Pinecone, Weaviate, ChromaDB)
Implement hybrid search (combining semantic and keyword search)
Add citation tracking
Use more sophisticated chunking strategies
Implement caching for embeddings
Add evaluation metrics

Conclusion

RAG doesn’t have to be complicated. This 60-line example demonstrates the core concept: convert text to embeddings, find similar content, and use it to augment generation. From here, you can expand to handle larger documents, more complex queries, and production-scale deployments.

The beauty of RAG is that it makes AI systems more reliable and maintainable. Instead of hoping your model memorized the right information during training, you explicitly provide it with relevant context for each query.

Đạt Trần

I build softwares that solve problems. I also love writing/documenting things I learn/want to learn.