↑ what actually happens every time you send a message
Open for Enterprise. We build custom, production-grade RAG systems for organizations — same 9-layer pipeline, tuned to your data, your permissions, your stack. Deployed in days, not months. At a fraction of what closed-source vendors charge. → See what we can build for you
| 🧬 Contextual Retrieval LLM prepends situating context to every chunk before indexing. Each vector carries the full document story, not just a fragment. | 🔀 RAG-Fusion + RRF Generates N query variants, retrieves independently for each, then merges all ranked lists via Reciprocal Rank Fusion for better recall. | 🕸️ GraphRAG Builds a NetworkX knowledge graph over document entities. Retrieves relational context that a pure vector search would miss entirely. |
| ✅ Corrective RAG (CRAG) LLM grades every retrieved chunk for relevance. Noise is silently dropped before the answer is generated. The model only sees what matters. | ⚡ Neural Reranking A Cross-Encoder (ms-marco-MiniLM) reorders all retrieval candidates by true query–passage relevance score, not just embedding similarity. | 🔭 HyDE Generates a hypothetical answer first to expand sparse queries into a richer dense embedding space before the actual retrieval step. |
🧠 Live Reasoning Panel
Streams the model's <think> chain-of-thought in real time. Watch it reason through your documents before the answer appears.
|
💾 Semantic Cache Cosine-similarity cache at threshold 0.92 on query embeddings. Repeat questions skip retrieval and generation entirely — answer is instant. | 💬 Chat Memory Full multi-turn conversation history flows into every generation call. Ask follow-ups naturally; the model remembers what you discussed. |
Upload (PDF / DOCX / TXT)
│
├── Chunk documents
│
└── [Contextual Retrieval ON]──► LLM enriches each chunk with surrounding context
│
▼
┌──────────────────────────────────┐
│ BM25 · FAISS · Graph │ ← three indexes built
└──────────────────────────────────┘
│
Query arrives
│
┌───────────────┴───────────────┐
▼ ▼
💾 Semantic Cache? 🔀 RAG-Fusion
┌── HIT → return instantly multi-query expansion
│ MISS ↓ │
│ RRF merge of results
│ + GraphRAG entity boost
│ │
│ ⚡ Neural Rerank (CrossEncoder)
│ │
│ ✅ CRAG: grade each chunk
│ drop irrelevant ones
│ │
│ 🧠 LLM stream
│ <think> panel live
│ │
└─────────────────────────────► Answer + Source cards
Before you begin:
1 — Clone
git clone https://github.com/SaiAkhil066/CORTEX-AI-SUPER-RAG.git
cd CORTEX-AI-SUPER-RAG
2 — Install
pip install -r requirements.txt
Windows only: if you get a
c10.dllDLL error on first run, pin PyTorch to the stable CPU build:bash pip uninstall torch -y pip install "torch==2.1.2" --index-url https://download.pytorch.org/whl/cpu
3 — Pull models
ollama pull llama3.1:8b # LLM (swap for any model you prefer)
ollama pull nomic-embed-text # Embeddings (required)
4 — Run
python -m streamlit run app.py
Open http://localhost:8501
Use
python -m streamlit run(not barestreamlit run) to ensure the correct Python environment is picked up.
The model selector in the sidebar auto-populates from your locally installed Ollama models. Swap freely — no config change needed.
| Model | Params | Speed | Notes |
|---|---|---|---|
llama3.1:8b |
8B | ⚡⚡⚡ | Default · best all-round balance |
qwen2.5:7b |
7B | ⚡⚡⚡ | Strong on multilingual documents |
mistral:7b |
7B | ⚡⚡⚡ | Fast, great for long documents |
llama3.1:70b |
70B | ⚡ | Best quality when speed isn't priority |
qwen2.5-coder:7b |
7B | ⚡⚡⚡ | Best for code / technical docs |
🐳 Docker setup
Option A — Ollama on host (recommended)
docker-compose build && docker-compose up
Ollama runs natively; the container connects via the host network.
Option B — Everything in Docker
version: "3.8"
services:
ollama:
image: ghcr.io/jmorganca/ollama:latest
ports:
- "11434:11434"
cortex-rag-service:
build: .
ports:
- "8501:8501"
environment:
- OLLAMA_API_URL=http://ollama:11434
- MODEL=llama3.1:8b
- EMBEDDINGS_MODEL=nomic-embed-text:latest
- CROSS_ENCODER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
depends_on:
- ollama
docker-compose up
| UI | Streamlit 1.30 | LLM inference | Ollama (local) |
| Vector store | FAISS | Sparse retrieval | BM25 (rank-bm25) |
| Knowledge graph | NetworkX | Neural reranker | sentence-transformers CrossEncoder |
| Embeddings | nomic-embed-text via Ollama | RAG orchestration | LangChain + langchain-classic |
| Document loading | PyMuPDF · Docx2txt · TextLoader | Supported files | PDF · DOCX · TXT · MD |
Built with curiosity · runs on your machine · owned by you
Reddit · Issues · Pull Requests
The future of retrieval-augmented AI is local — no internet required.
If Cortex RAG saved you time, consider buying us a coffee ☕
Every contribution keeps this project free and open-source.
$ claude mcp add CORTEX-AI-SUPER-RAG \
-- python -m otcore.mcp_server <graph>