Production · KUB

RAG Knowledge System

Sole architect & developer · 2025–2026

⚡ 5 min → 30 sec document lookup

Problem

KUB's 17-person business analytics team relied on 20+ internal operational documents to answer day-to-day questions. When the document owner was unavailable, tracking down an answer took anywhere from five minutes to over an hour. External AI tools were off the table entirely — KUB's data governance policy prohibits sending internal operational data outside the network.

System Design

The core constraint — no external API calls — shaped every architectural decision. The entire system runs locally using Ollama as the LLM host. Nothing leaves KUB's network.

Retrieval runs two approaches in parallel: BM25 for exact keyword matching and dense vector search for semantic similarity. Results from both paths are pooled into up to 40 candidates, deduplicated by content hash, then passed to a cross-encoder reranker that scores each candidate against the original query before narrowing to the top five. The cross-encoder is slower than cosine similarity alone but substantially more accurate for operational text where precise terminology matters.

Every response streams to the user in real time. All interactions are logged through LangSmith, building a dataset for future retrieval improvements.

What I Built

01The whole system runs locally using Ollama as the LLM host. Nothing leaves KUB's network, which was the non-negotiable constraint that ruled out external APIs from day one.
02Before searching, the system rewrites each question into three differently-worded variants. This matters because the same concept gets described five different ways across different documents, and a single query often misses relevant chunks.
03Retrieval runs two approaches in parallel: BM25 for exact keyword matching and ChromaDB for semantic similarity. Results are merged and deduplicated so the LLM isn't reading the same passage twice.
04After retrieval, a cross-encoder reranks every candidate against the original question before narrowing to the top five. It's slower than cosine similarity, but much more accurate for operational text where word choice matters.
05Responses stream in real time and every interaction gets logged through LangSmith, building a dataset to improve retrieval quality over time.

Architecture

Click any node to learn more.

Engineering Challenge

The initial retrieval version relied primarily on semantic search, and it failed on domain-specific terminology. KUB's operational documents use precise technical vocabulary that semantic search alone misrepresented. Adding BM25 as a parallel path and tuning the fusion weights between the two approaches was the fix that made retrieval reliable.

Evaluation

Answer quality was measured using the RAGAS framework, establishing an objective baseline for tracking improvement across iterations rather than relying on qualitative judgment.

Results

<10s

Per query (was 5+ min)

Analysts using it daily

External API calls

Document lookups that previously took five or more minutes now complete in under 10 seconds. Seventeen analysts use the system daily across 20+ operational documents, all running without a single external API call.

Stack

LangChainChromaDBOllamaBM25Cross-encoderRAGASLangSmithDockerStreamlitPython

GitHub

Next Project

DeepAR Gas Forecasting→