Goal: Sub-100ms vector updates with an 80% cost reduction via Semantic Caching.
The comprehensive a Real-time Vector RAG project guide - contains a production-grade blueprint for building a Low-Latency Vector Ingestion Pipeline using Apache Kafka, Flink SQL, and Qdrant. It features a Semantic Cache that reduces LLM API costs by up to 80% and an automated Budget Guard to prevent runaway automation costs.
Architecture
- Ingestion: Kafka Producers optimized with
linger.ms=0for instant dispatch. - Transformation: Flink SQL treats vector generation as a continuous stream.
- Storage: Qdrant with HNSW Indexing for <20ms vector lookups.
- Semantic Layer: A Python-based cache that intercepts redundant queries.
- Monitoring: Prometheus & Grafana dashboard tracking P99 Latency and ROI.
The repository includes the complete local setup, the Semantic Cache logic with Budget Guard protection, and the automated Midnight Reset to keep the costs under control.
1. Project Directory Structure
md/vector-stream-rag ├── /apps │ ├── /web # react/Next.js (Frontend) │ └── /ingestion # Python (Kafka Ingestion + Cache Logic) ├── /infra │ ├── docker-compose.yml # Kafka, Qdrant, Prometheus, Grafana │ ├── prometheus.yml # Scrape configuration │ └── flink-job.sql # Streaming SQL ├── /scripts │ └── reset_budget.py # Daily credit reset ├── .env # OpenAI & Kafka Keys └── README.md
2. Installation
Step 1: Launch the Infrastructure Create infra/docker-compose.yml and run
bashdocker-compose up -d.
This starts the message broker, vector database, and monitoring stack.
Step 2: Setup the Python Ingestion App Install dependencies and initialize the Budget Guard. This prevents the automated cache-warmer from spending more than the daily limit (e.g., $5.00).
bashpip install kafka-python prometheus_client qdrant-client openai
3. Core Logic: Semantic Cache & Budget Guard
This script (apps/ingestion/processor.py) sits in the Kafka consumer loop. It checks the cache first, and if it must call the LLM, it verifies the budget.
python#apps/ingestion/processor.py import json, os from prometheus_client import Counter, start_http_server BUDGET_FILE = "../../usage_tracker.json" LIMIT = 5.00 # $5.00 Daily Limit def is_budget_safe(): if not os.path.exists(BUDGET_FILE): return True with open(BUDGET_FILE, "r") as f: return json.load(f).get("spent", 0) < LIMIT def process_query(query): # 1. Semantic Cache Lookup (<20ms) hit = search_vector_db(query) if hit: return hit # 2. Safety Check before LLM call if not is_budget_safe(): return "Budget exceeded. Serving stale data." # 3. LLM Call & Update Budget response = call_openai(query) update_usage(0.01) # Record $0.01 per call return response
4. Automated Midnight Reset (Cron Job)
To ensure your budget clears every night, use a Cron job to run a reset script.
The Reset Script (scripts/reset_budget.py):
pythonimport json with open("usage_tracker.json", "w") as f: json.dump({"spent": 0.0}, f) print("📅 Budget reset for the new day.")
- Setting the Cron Job:
Open your terminal and type
crontab -e. Add this line to run the reset every night at midnight:
bash0 0 * * * /usr/bin/python3 /path/to/scripts/reset_budget.py
Local Setup
1. Launch Infrastructure
Start the Kafka broker, Vector DB, and Monitoring stack:
bashdocker-compose -f infra/docker-compose.yml up -d
2. Setup the Ingestion Engine
Install dependencies and start the Python consumer:
bashcd apps/ingestion pip install -r requirements.txt python processor.py
🛡️ The "Budget Guard" Feature
To prevent automated Cache Warming or frequent LLM calls from exceeding your budget, the system includes a persistent safety switch.
- Limit: $5.00/day (Configurable in processor.py).
- Safety: If the limit is reached, the system falls back to "Stale Mode" and alerts the team via Grafana.
- Auto-Reset: A Midnight Cron Job clears the usage daily:
bash0 0 * * * /usr/bin/python3 scripts/reset_budget.py
Tech Stack
- Streaming: Apache Kafka (Bitnami)
- Vector DB: Qdrant
- Transformation: Flink SQL
- Frontend: react/Next.js 14, Tailwind
- Monitoring:
Prometheus,Grafana
ACTION_REQUIRED
Is your cache serving the right intent?