RAG Knowledge System
Sole architect & developer · 2025–2026
Problem
KUB's 17-person analytics team depended on 20+ internal documents to answer day-to-day questions. When the person who owned a document wasn't around, finding an answer could take anywhere from five minutes to over an hour. Using ChatGPT or any external API was completely off the table too. Sending internal operational data outside the network wasn't something KUB's data governance policy would allow.
What I Built
- 01The whole system runs locally using Ollama as the LLM host. Nothing leaves KUB's network, which was the non-negotiable constraint that ruled out external APIs from day one.
- 02Before searching, the system rewrites each question into three differently-worded variants. This matters because the same concept gets described five different ways across different documents, and a single query often misses relevant chunks.
- 03Retrieval runs two approaches in parallel: BM25 for exact keyword matching and ChromaDB for semantic similarity. Results are merged and deduplicated so the LLM isn't reading the same passage twice.
- 04After retrieval, a cross-encoder reranks every candidate against the original question before narrowing to the top five. It's slower than cosine similarity, but much more accurate for operational text where word choice matters.
- 05Responses stream in real time and every interaction gets logged through LangSmith, building a dataset to improve retrieval quality over time.
User query → Multi-query expansion (3 LLM-rephrased variants) → Hybrid retrieval: BM25 (keyword) + ChromaDB (dense vector) → Candidate pool: up to 60 pairs, content-hash deduped → Cross-encoder reranking → top-5 results → HAL (internal LLM via Ollama) generates answer → Streamed to user · feedback logged via LangSmith
Results
Document lookups that used to take five minutes now finish in about 30 seconds. The old process of tracking down whoever owned a document just doesn't happen anymore. Seventeen analysts use it daily across 20+ operational documents, all running without a single external API call.
Stack
Next Project
DeepAR Gas Forecasting→