AI Infrastructure

Real-time Vector Ingestion via Kafka

Apr 5, 2026
15 min min read

How we achieved sub-100ms latency for global embedding updates.

Traditional RAG pipelines suffer from "intelligence lag." To enable truly live AI, we built a streaming pipeline using Apache Kafka and Flink to achieve sub-100ms end-to-end latency. This guide walks through the architecture, code, and tuning required to build it.

1. Architecture: The Streaming Shift

We moved away from batch ETL to a continuous event-driven model.

  1. Source: CDC or user events trigger Kafka messages.
  2. Transformation: Flink SQL generates embeddings in-flight.
  3. Sink: A high-performance consumer upserts vectors into an HNSW-indexed store.

2. The Kafka Producer : Tuning for speed

To hit sub-100ms targets, we tuned every layer of the stack:

The Producer: Immediate Dispatch

python
producer = KafkaProducer(
    bootstrap_servers=['broker:9092'],
    acks=1,                  # Balance durability and speed
    linger_ms=0,             # Disable batching for instant sends
    compression_type=None    # Save CPU cycles
)

Optimize the Kafka Producer

The first bottleneck is "time-to-broker." Default settings favor throughput (batching); we must pivot to latency-first.

python
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers=['broker:9092'],
    acks=1,                  # Immediate ack from leader
    linger_ms=0,             # No waiting to batch; send instantly
    compression_type=None,   # Zero CPU overhead for small payloads
    buffer_memory=67108864   # 64MB to prevent blocking
)

Use Flink SQL to treat vector generation as a continuous table transformation. This eliminates the need for manual "Fetch-Embed-Upload" loops.

sql
-- 1. Reference the Embedding Model
CREATE MODEL `text_embedder`
INPUT (text STRING) OUTPUT (embedding ARRAY<FLOAT>)
WITH ('TASK' = 'embedding', 'PROVIDER' = 'openai');

-- 2. Define the Vector Sink (e.g., Qdrant or Milvus)
CREATE TABLE vector_sink (
  id STRING, 
  embedding ARRAY<FLOAT>
) WITH ('connector' = 'kafka', 'topic' = 'vector-embeddings', 'format' = 'json');

-- 3. Execute Continuous Ingestion
INSERT INTO vector_sink
SELECT id, AI_EMBEDDING('text_embedder', content) 
FROM source_kafka_topic;

4. Implement Adaptive Batching: Adaptive Batching for P99 Resilience

Inference latency is volatile. An Adaptive Batching Consumer monitors P99 processing time and shrinks batch sizes if the model slows down to maintain the 100ms goal.

python
if duration_ms > 100.0:
    # Backpressure: Shrink batch to maintain the 100ms target
    self.current_batch_limit = max(1, int(self.current_batch_limit * 0.8))
elif duration_ms < 50.0:
    # Efficiency: Scale up if we have headroom
    self.current_batch_limit = min(500, int(self.current_batch_limit * 1.1))

5. Tune Indexing (HNSW)

For real-time writes, use the HNSW (Hierarchical Navigable Small World) algorithm. It allows for near-instant incremental indexing.

  • M (Links per node): Set to 16 for a balance of speed and memory.
  • ef_construction: Set to 128. Higher values improve recall but increase the P99 write latency beyond our 100ms budget.

6. Infrastructure, CI/CD & Monitoring

Deploy using the Strimzi Operator on Kubernetes. Use Node Affinity to keep your consumers and inference engines on the same physical node to eliminate network hops.

Automated Deployment (GitHub Actions):

yml
name: Deploy Real-time Vector Stack
on:
  push:
    branches: [ main ]
    paths:
      - 'k8s/**'  # Only trigger on manifest changes

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Set up Kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}

      - name: Deploy Kafka (Strimzi)
        run: |
          # Apply the Kafka Cluster and Topics
          kubectl apply -f k8s/kafka/cluster.yaml
          kubectl apply -f k8s/kafka/topics.yaml
          # Wait for cluster readiness
          kubectl wait kafka/vector-cluster --for=condition=Ready --timeout=300s

      - name: Deploy Flink SQL Job
        run: |
          # Apply the FlinkDeployment Custom Resource
          kubectl apply -f k8s/flink/vector-ingestion-job.yaml

The Monitoring North Star:

Use Prometheus and Grafana to track the Inverse Correlation:

  • When P99 Latency spikes → Batch Size should drop.
  • When Latency recovers → Throughput should rise.

Final Performance & Resource Breakdown

StageLatencyResourcePrimary Bottleneck
Kafka Broker Transit~15msLow (I/O Bound)Disk Throughput
Model Inference~50msHigh (GPU/CPU)Model Quantization
Vector DB Indexing~20msMedium (RAM)HNSW Graph Depth
Total E2E~85msOptimizedEnd-to-End Sync

Conclusion: Why Real-time Vectors Matter

Building a sub-100ms vector ingestion pipeline is no longer just a "nice-to-have" feature; it is the foundation for Adaptive AI. By moving from batch processing to a Kafka-driven streaming architecture, we’ve eliminated the intelligence gap that plagues traditional RAG systems.

Key Takeaways:

  • Speed is a Feature: Sub-100ms latency allows AI to react to user behavior within the same session.
  • Resilience via Adaptation: Using adaptive batching ensures that even when external APIs slow down, your system stays responsive.
  • Infrastructure as Code: Leveraging Strimzi and GitHub Actions makes this high-performance stack maintainable and scalable.

By following this step-by-step guide—from producer tuning to Kubernetes deployment—you can transform your static data into a living, breathing vector stream that powers the next generation of AI applications.

ACTION_REQUIRED

Ready to benchmark your own pipeline?