Learn to build modern AI systems from the ground up through hands-on implementation
Master the most in-demand AI engineering skills: RAG (Retrieval-Augmented Generation)
This is a learner-focused project where you'll build a complete research assistant system that automatically fetches academic papers, understands their content, and answers your research questions using advanced RAG techniques.
The arXiv Paper Curator will teach you to build a production-grade RAG system using industry best practices. Unlike tutorials that jump straight to vector search, we follow the professional path: master keyword search foundations first, then enhance with vectors for hybrid retrieval.
🎯 The Professional Difference: We build RAG systems the way successful companies do - solid search foundations enhanced with AI, not AI-first approaches that ignore search fundamentals.
By the end of this course, you'll have your own AI research assistant and the deep technical skills to build production RAG systems for any domain.

Complete Week 7 architecture showing Telegram bot integration with the agentic RAG system

Detailed LangGraph workflow showing decision nodes, document grading, and adaptive retrieval
Agentic RAG Workflow:
User Query → Guardrail Node → [PROCEED or OUT_OF_SCOPE]
↓
Retrieve Node (attempt 1)
↓
Grade Documents → [RELEVANT or INSUFFICIENT]
↓
[If INSUFFICIENT] → Rewrite Query → Retrieve Node (attempt 2)
↓
Generate Answer Node → Final Response with Citations
Key Innovations in Week 7: - 🤖 Intelligent Decision-Making: Agents evaluate and adapt retrieval strategies - 🔍 Document Grading: Automatic relevance assessment with semantic evaluation - 🔄 Query Rewriting: Adaptive query refinement when results are insufficient - 🛡️ Guardrails: Out-of-domain detection prevents hallucination - 📱 Mobile Access: Telegram bot for conversational AI on any device - 🔎 Transparency: Full reasoning step tracking for debugging and trust
# 1. Clone and setup
git clone <repository-url>
cd arxiv-paper-curator
# 2. Configure environment (IMPORTANT!)
cp .env.example .env
# The .env file contains all necessary configuration for OpenSearch,
# arXiv API, and service connections. Defaults work out of the box.
# For Week 4: Add JINA_API_KEY=your_key_here for hybrid search
# 3. Install dependencies
uv sync
# 4. Start all services
docker compose up --build -d
# 5. Verify everything works
curl http://localhost:8000/health
| Week | Topic | Blog Post | Code Release |
|---|---|---|---|
| Week 0 | The Mother of AI project - 6 phases | The Mother of AI project | - |
| Week 1 | Infrastructure Foundation | The Infrastructure That Powers RAG Systems | week1.0 |
| Week 2 | Data Ingestion Pipeline | Building Data Ingestion Pipelines for RAG | week2.0 |
| Week 3 | OpenSearch ingestion & BM25 retrieval | The Search Foundation Every RAG System Needs | week3.0 |
| Week 4 | Chunking & Hybrid Search | The Chunking Strategy That Makes Hybrid Search Work | week4.0 |
| Week 5 | Complete RAG system | The Complete RAG System | week5.0 |
| Week 6 | Production monitoring & caching | Production-ready RAG: Monitoring & Caching | week6.0 |
| Week 7 | Agentic RAG & Telegram Bot | Agentic RAG with LangGraph and Telegram | week7.0 |
📥 Clone a specific week's release:
# Clone a specific week's code
git clone --branch <WEEK_TAG> https://github.com/jamwithai/arxiv-paper-curator
cd arxiv-paper-curator
uv sync
docker compose down -v
docker compose up --build -d
# Replace <WEEK_TAG> with: week1.0, week2.0, etc.
| Service | URL | Purpose |
|---|---|---|
| API Documentation | http://localhost:8000/docs | Interactive API testing |
| Gradio RAG Interface | http://localhost:7861 | User-friendly chat interface |
| Langfuse Dashboard | http://localhost:3000 | RAG pipeline monitoring & tracing |
| Airflow Dashboard | http://localhost:8080 | Workflow management |
| OpenSearch Dashboards | http://localhost:5601 | Hybrid search engine UI |
Start here! Master the infrastructure that powers modern RAG systems.

Infrastructure Components:
- FastAPI: REST endpoints with async support (Port 8000)
- PostgreSQL 16: Paper metadata storage (Port 5432)
- OpenSearch 2.19: Search engine with dashboards (Ports 9200, 5601)
- Apache Airflow 3.0: Workflow orchestration (Port 8080)
- Ollama: Local LLM server (Port 11434)
# Launch the Week 1 notebook
uv run jupyter notebook notebooks/week1/week1_setup.ipynb
Completion Guide: Follow the Week 1 notebook for hands-on setup and verification steps.
Blog Post: The Infrastructure That Powers RAG Systems - Detailed walkthrough and production insights
Building on Week 1 infrastructure: Learn to fetch, process, and store academic papers automatically.

Data Pipeline Components:
- MetadataFetcher: 🎯 Main orchestrator coordinating the entire pipeline
- ArxivClient: Rate-limited paper fetching with retry logic
- PDFParserService: Docling-powered scientific document processing
- Airflow DAGs: Automated daily paper ingestion workflows
- PostgreSQL Storage: Structured paper metadata and content
# Launch the Week 2 notebook
uv run jupyter notebook notebooks/week2/week2_arxiv_integration.ipynb
Completion Guide: Follow the Week 2 notebook for hands-on implementation and verification steps.
Blog Post: Building Data Ingestion Pipelines for RAG - arXiv API integration and PDF processing
Building on Weeks 1-2 foundation: Implement the keyword search foundation that professional RAG systems rely on.

Search Infrastructure Components:
- OpenSearch Service: src/services/opensearch/ - Professional search service implementation
- Search API: src/routers/search.py - Search API endpoints with BM25 scoring
- Learning Materials: notebooks/week3/ - Complete OpenSearch integration guide
- Quality Metrics: Precision, recall, and relevance scoring
# Launch the Week 3 notebook
uv run jupyter notebook notebooks/week3/week3_opensearch.ipynb
Completion Guide: Follow the Week 3 notebook for hands-on OpenSearch setup and BM25 search implementation.
Blog Post: The Search Foundation Every RAG System Needs - Complete BM25 implementation with OpenSearch
Building on Week 3 foundation: Add the semantic layer that makes search truly intelligent.

Hybrid Search Infrastructure Components:
- Text Chunker: src/services/indexing/text_chunker.py - Section-aware chunking with overlap strategies
- Embeddings Service: src/services/embeddings/ - Production embedding pipeline with Jina AI
- Hybrid Search API: src/routers/hybrid_search.py - Unified search API supporting all modes
- Learning Materials: notebooks/week4/ - Complete hybrid search implementation guide
# Launch the Week 4 notebook
uv run jupyter notebook notebooks/week4/week4_hybrid_search.ipynb
Completion Guide: Follow the Week 4 notebook for hands-on implementation and verification steps.
Blog Post: The Chunking Strategy That Makes Hybrid Search Work - Production chunking and RRF fusion implementation
Building on Week 4 hybrid search: Add the LLM layer that turns search into intelligent conversation
$ claude mcp add production-agentic-rag-course \
-- python -m otcore.mcp_server <graph>