NexEdge AI // Production-Grade Intelligence ~~ Enterprise AI Integration

The comprehensive a Real-time Vector RAG project guide - contains a production-grade blueprint for building a Low-Latency Vector Ingestion Pipeline using Apache Kafka, Flink SQL, and Qdrant. It features a Semantic Cache that reduces LLM API costs by up to 80% and an automated Budget Guard to prevent runaway automation costs.

Architecture

Ingestion: Kafka Producers optimized with linger.ms=0 for instant dispatch.
Transformation: Flink SQL treats vector generation as a continuous stream.
Storage: Qdrant with HNSW Indexing for <20ms vector lookups.
Semantic Layer: A Python-based cache that intercepts redundant queries.
Monitoring: Prometheus & Grafana dashboard tracking P99 Latency and ROI.

The repository includes the complete local setup, the Semantic Cache logic with Budget Guard protection, and the automated Midnight Reset to keep the costs under control.

1. Project Directory Structure

md
/vector-stream-rag
├── /apps
│   ├── /web              # react/Next.js (Frontend)
│   └── /ingestion        # Python (Kafka Ingestion + Cache Logic)
├── /infra
│   ├── docker-compose.yml # Kafka, Qdrant, Prometheus, Grafana
│   ├── prometheus.yml    # Scrape configuration
│   └── flink-job.sql     # Streaming SQL
├── /scripts
│   └── reset_budget.py   # Daily credit reset
├── .env                  # OpenAI & Kafka Keys
└── README.md

2. Installation

Step 1: Launch the Infrastructure Create infra/docker-compose.yml and run

bash
 docker-compose up -d.

This starts the message broker, vector database, and monitoring stack.

Step 2: Setup the Python Ingestion App Install dependencies and initialize the Budget Guard. This prevents the automated cache-warmer from spending more than the daily limit (e.g., $5.00).

bash
pip install kafka-python prometheus_client qdrant-client openai

3. Core Logic: Semantic Cache & Budget Guard

This script (apps/ingestion/processor.py) sits in the Kafka consumer loop. It checks the cache first, and if it must call the LLM, it verifies the budget.

python
#apps/ingestion/processor.py
import json, os
from prometheus_client import Counter, start_http_server

BUDGET_FILE = "../../usage_tracker.json"
LIMIT = 5.00 # $5.00 Daily Limit

def is_budget_safe():
    if not os.path.exists(BUDGET_FILE): return True
    with open(BUDGET_FILE, "r") as f:
        return json.load(f).get("spent", 0) < LIMIT

def process_query(query):
    # 1. Semantic Cache Lookup (<20ms)
    hit = search_vector_db(query) 
    if hit: return hit 

    # 2. Safety Check before LLM call
    if not is_budget_safe():
        return "Budget exceeded. Serving stale data."

    # 3. LLM Call & Update Budget
    response = call_openai(query)
    update_usage(0.01) # Record $0.01 per call
    return response

4. Automated Midnight Reset (Cron Job)

To ensure your budget clears every night, use a Cron job to run a reset script. The Reset Script (scripts/reset_budget.py):

python
import json
with open("usage_tracker.json", "w") as f:
    json.dump({"spent": 0.0}, f)
print("📅 Budget reset for the new day.")

Setting the Cron Job: Open your terminal and type crontab -e. Add this line to run the reset every night at midnight:

bash
0 0 * * * /usr/bin/python3 /path/to/scripts/reset_budget.py

Local Setup

1. Launch Infrastructure

Start the Kafka broker, Vector DB, and Monitoring stack:

bash
docker-compose -f infra/docker-compose.yml up -d

2. Setup the Ingestion Engine

Install dependencies and start the Python consumer:

bash
cd apps/ingestion
pip install -r requirements.txt
python processor.py

🛡️ The "Budget Guard" Feature

To prevent automated Cache Warming or frequent LLM calls from exceeding your budget, the system includes a persistent safety switch.

Limit: $5.00/day (Configurable in processor.py).
Safety: If the limit is reached, the system falls back to "Stale Mode" and alerts the team via Grafana.
Auto-Reset: A Midnight Cron Job clears the usage daily:

bash
0 0 * * * /usr/bin/python3 scripts/reset_budget.py

Tech Stack

Streaming: Apache Kafka (Bitnami)
Vector DB: Qdrant
Transformation: Flink SQL
Frontend: react/Next.js 14, Tailwind
Monitoring: Prometheus, Grafana

Real-time Vector RAG: Sub-100ms Ingestion & Semantic Caching