RAG Glossary: Key Terms and Concepts
- Last modified
RAG Glossary: Key Terms and Concepts
A reference guide to the terminology used in Retrieval-Augmented Generation systems. Bookmark this for quick lookups when navigating RAG documentation and discussions.
Core Concepts
RAG (Retrieval-Augmented Generation)
An architecture that combines information retrieval with language model generation. Instead of relying solely on a model's training data, RAG retrieves relevant documents at query time and uses them to ground responses.
Vector Embedding
A numerical representation of text (or other content) in a high-dimensional space where semantically similar items are close together. Embeddings enable semantic search—finding content by meaning rather than exact keyword matches.
Semantic Search
Search based on meaning rather than keywords. Uses vector embeddings to find documents that are conceptually related to a query, even if they don't share exact terms.
Knowledge Base
The collection of documents, data, and information that a RAG system can retrieve from. May include documents, FAQs, product catalogs, internal wikis, and other content sources.
Document Processing
Chunking
The process of splitting documents into smaller pieces for embedding and retrieval. Chunk size and strategy significantly impact retrieval quality.
Chunk Overlap
Including some content from adjacent chunks to preserve context at boundaries. Typically 10-20% overlap helps maintain coherence.
Document Loader
A component that ingests documents from various sources (PDFs, web pages, databases) and prepares them for processing.
Text Splitter
The algorithm or method used to divide documents into chunks. Options include character-based, token-based, semantic, and recursive splitting.
Metadata
Additional information attached to document chunks (source, date, author, category) that can be used for filtering during retrieval.
Embeddings & Storage
Embedding Model
A neural network that converts text into vector embeddings. Popular options include OpenAI's text-embedding-ada-002, Cohere's embed models, and open-source alternatives like BAAI/bge.
Vector Database
A database optimized for storing, indexing, and searching vector embeddings. Examples: Pinecone, Weaviate, Chroma, Qdrant, Milvus, pgvector.
Vector Index
A data structure that enables efficient similarity search over vector embeddings. Common types include HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index).
Dimension
The number of elements in a vector embedding. Common dimensions range from 384 to 1536. Higher dimensions can capture more nuance but require more storage and computation.
Retrieval
Similarity Search
Finding vectors that are mathematically close to a query vector. Common metrics include cosine similarity, Euclidean distance, and dot product.
Top-K Retrieval
Returning the K most similar documents to a query. K is typically 3-10, balancing relevance with context window limits.
Dense Retrieval
Retrieval using vector embeddings (dense vectors). Captures semantic meaning but may miss exact keyword matches.
Sparse Retrieval
Traditional keyword-based retrieval using sparse vectors (like BM25). Good at exact matching but misses semantic relationships.
Hybrid Search
Combining dense and sparse retrieval to leverage the strengths of both. Often implemented with weighted combination of scores.
Reranking
A second-stage process that uses a more sophisticated model to reorder initial retrieval results for better relevance.
Cross-Encoder
A model used for reranking that processes the query and document together to produce a relevance score. More accurate than bi-encoders but slower.
Bi-Encoder
A model architecture where query and document are encoded separately, enabling fast retrieval through pre-computed document embeddings.
Generation
Context Window
The maximum amount of text (measured in tokens) that a language model can process at once. Retrieved documents must fit within this limit along with the query and response.
Prompt Template
A structured template that combines the user's query with retrieved context and instructions for the language model.
Grounding
Ensuring that generated responses are based on and supported by the retrieved context, rather than the model's general knowledge.
Citation
Referencing the specific source document(s) that support claims in a generated response. Improves transparency and trustworthiness.
Hallucination
When a language model generates content that is plausible-sounding but not factually accurate or not supported by the provided context.
Advanced Concepts
Query Expansion
Techniques to reformulate or expand user queries to improve retrieval. May include adding synonyms, generating sub-questions, or using LLMs to create better search queries.
HyDE (Hypothetical Document Embeddings)
A technique where the LLM first generates a hypothetical answer, which is then embedded and used for retrieval. Can improve retrieval for certain query types.
Multi-Query Retrieval
Generating multiple variants of a query and combining retrieval results. Increases recall by capturing different phrasings.
Parent-Child Retrieval
Retrieving smaller chunks for precision but returning larger parent sections for context. Balances specificity with completeness.
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)
A technique that builds hierarchical summaries of documents for multi-level retrieval.
Graph RAG
Combining knowledge graphs with RAG to capture entity relationships and enable more structured reasoning.
Agentic RAG
RAG systems where an AI agent decides when and how to retrieve, potentially making multiple retrieval calls and reasoning over results.
MAG (Memory-Augmented Generation)
Systems that maintain persistent memory across conversations, enabling long-term context and personalization.
Evaluation Metrics
Retrieval Precision
The proportion of retrieved documents that are actually relevant to the query.
Retrieval Recall
The proportion of all relevant documents that were successfully retrieved.
MRR (Mean Reciprocal Rank)
Measures how high the first relevant document appears in the ranked results. Higher is better.
NDCG (Normalized Discounted Cumulative Gain)
A metric that accounts for both relevance and ranking position of retrieved documents.
Faithfulness
Whether generated answers are supported by the retrieved context (vs. using general knowledge or hallucinating).
Answer Relevance
Whether generated answers actually address the user's question.
Context Relevance
Whether the retrieved context is appropriate for answering the query.
Infrastructure Terms
Latency
The time between a query being submitted and a response being returned. RAG adds retrieval latency to generation latency.
Throughput
The number of queries a RAG system can handle per unit of time.
Cold Start
Delay when a system needs to initialize or load models/indexes before processing queries.
Caching
Storing frequently retrieved documents or embeddings to reduce latency for common queries.
This glossary covers the most common RAG terminology. As the field evolves, new concepts and techniques continue to emerge.
Building a RAG system and need guidance? Get in touch to discuss your implementation.