Overview
A production-grade, multi-modal RAG knowledge base platform featuring real-time streaming chat, multi-turn conversation-history compaction, session-scoped document retrieval, and an admin console — fully containerized and running entirely on local hardware via Docker Compose (no external API calls).
Implementation
- Layered backend: FastAPI with a clean Router → Service → Repository → Schema architecture, async SQLAlchemy 2.x (AsyncSession + aiosqlite), Alembic migrations, and JWT (HS256) + bcrypt auth with role-based access (admin / user)
- Local LLM stack via Ollama:
gpt-ossfor reasoning/chat,llava:7bas the vision model for image and multi-modal document parsing, andbge-m3for 1024-dim GPU embeddings - RAG engine: RAGAnything built on LightRAG, with custom adapters (LLM / vision / embedding) and a
ChromaVectorDBStorageadapter implementing LightRAG'sBaseVectorStorageinterface against ChromaDB - Multi-modal ingestion: MinerU for PDF layout/OCR parsing, LibreOffice for DOCX/PPTX/XLSX, and llava vision captioning for images — all chunked, embedded, and indexed into ChromaDB
- Conversation compaction: an automatic mechanism that, once a session passes a message threshold, summarizes older turns via the LLM and keeps recent turns verbatim, keeping the context window bounded without breaking the conversation
- Session-scoped retrieval: retrieval is automatically confined to documents attached to the current session, preventing cross-contamination with the global knowledge base
- Streaming frontend: Next.js 16 App Router + TypeScript + shadcn/ui + Zustand, consuming an SSE stream (
useSSEStream) for live token rendering, with a WebGL galaxy background (OGL) - Three query modes: Hybrid (semantic + knowledge graph), Local (focused chunks), and Global (graph-wide synthesis)
- Infra: Docker Compose orchestrating ChromaDB, Ollama, backend, and a multi-stage-built frontend, with health checks and dependency gating