Building Production-Ready RAG Systems with ElasticSearch Vector DB

Retrieval-Augmented Generation (RAG) has become the go-to architecture for building AI applications that need both the creativity of large language models and the accuracy of proprietary data. After implementing multiple RAG systems in production, here are the key lessons learned.

The Problem with Naive RAG

Most RAG implementations fail at scale because they treat retrieval as a simple vector search. The reality is that production RAG requires:

Multi-stage retrieval — not just embedding similarity, but hybrid search with keyword matching
Context window optimization — knowing what to include and what to filter
Re-ranking — post-retrieval scoring to improve relevance
Feedback loops — tracking which retrieved documents actually helped

ElasticSearch as Vector DB

ElasticSearch offers several advantages over dedicated vector databases for RAG:

Hybrid search out of the box — combine dense vector similarity with BM25 keyword scoring
Filtering — apply metadata filters before, during, and after vector search
Ingestion pipelines — preprocess documents before embedding
Maturity — battle-tested at massive scale

Implementation Pattern

# Hybrid search query combining vector and keyword
query = {
    "knn": {
        "field": "embedding",
        "k": 50,
        "num_candidates": 100,
        "query_vector": query_embedding
    },
    "filter": [
        {"term": {"category": "technical"}},
        {"range": {"created_at": {"gte": "2023-01-01"}}}
    ]
}

Key Lessons

1. Chunking Strategy Matters More Than You Think

Fixed-size chunking creates context problems. Instead:

Use semantic chunking — split at natural boundaries (paragraphs, sections)
Overlap chunks by 10-15% to preserve context across boundaries
Store chunk metadata: source, section path, hierarchy level

2. Re-ranking is Non-Negotiable

Vector similarity alone gives you 70% of the way there. A cross-encoder re-ranker improves relevance by 30-40%:

# Two-stage retrieval
# Stage 1: ElasticSearch hybrid search (fast, broad)
candidates = es.search(knn={"field": "embedding", "k": 50})

# Stage 2: Cross-encoder re-ranking (slow, precise)
ranked = cross_encoder.rerate(query, candidates[:50], top_k=5)

3. Handle Hallucination Proactively

Don't wait for the LLM to hallucinate. Implement:

Confidence scoring on retrieved context
Fallback responses when retrieval quality is low
Explicit "I don't know" triggers based on retrieval scores

Production Monitoring

Track these metrics in production:

Retrieval precision — does the top result contain relevant info?
Answer relevance — does the response match the query intent?
Latency — end-to-end from query to response
Cost per query — token usage including retrieval overhead

Conclusion

RAG is not a silver bullet, but with the right architecture — hybrid search, re-ranking, and proper monitoring — it can deliver production-quality results. The key is treating it as a system, not just a prompt engineering exercise.