RAG (Retrieval-Augmented Generation)
Introduction
There are three main mode for AI systems:
- Retrieval: The system retrieves relevant information from a database or knowledge base.
- Generation: The system generates new content based on the input it receives.
- Action: The system takes actions based on the input it receives. Like MCP, Agent, etc. Like Camel, the agent can interact with the environment and take actions based on the input it receives.
The RAG is welled grounded by the time I write this article and it includes mainly the first two modes.
How Rag works:
- Question: The user asks a question or provides input to the system.
- Retrieval: The system retrieves relevant information from a database or knowledge base, where the chunking and embedding are used to find the most relevant information. The knowlege are often saved as vector database, like Pinecone, Weaviate, etc. The embedding is often done by LLMs, like OpenAI’s text-embedding-ada-002.
- Generation: The system generates a response based on the retrieved information and the input it received.
Components of RAG
- Embedding: The process of converting text into a numerical representation that can be used by the system.
- Vector Database: A database that stores the embeddings and allows for efficient retrieval of relevant information.
- Retriver: The component that retrieves relevant information from the vector database based on the input it receives.
- Generator: The component that generates a response based on the retrieved information and the input it received.
The pipeline of Indexing
- Load: Load different format of data into the system. The data can be in different format, like TXT, PDF, CSV, etc.
- Chunk: The data is chunked into smaller pieces to make it easier to process and retrieve relevant information.
- Embed: The chunked data is embedded into a numerical representation that can be used by the system.
- Store: The embedded data is stored in a vector database for efficient retrieval.
The pipeline of Retrieval and generation
- Query: The user asks a question or provides input to the system. which include the clarification of the question, split the question into smaller pieces, and rephrase the question to make it easier to retrieve.
- Retrieve: Select and ranking the relevant information from the vector database based on the input it receives. The retriever can be a simple keyword search or a more complex semantic search.
The pipeline of Generation
The retrieved information is provided to the generator together with the input question. The GenAI can then do the reponse generation. The generator can be a simple template-based system or a more complex LLM-based system.
How to optimize the RAG system
Format of the data
data can be from different sources and in different formats, like TXT, PDF, CSV, etc. The format of the data can affect the performance of the RAG system. For example, if the data is in a structured format, like CSV, it may be easier to process and retrieve relevant information. If the data is in an unstructured format, like PDF, it may require more processing to extract relevant information. It is important to consider the format of the data when designing and optimizing the RAG system.
From my experience, use Markdown as the format of the data can be very helpful for LL as it concentrate more on the content and less on the format. The markdown format can also help to capture the structure of the data, like headings, lists, etc. which can be useful for retrieval and generation. The markdown format can also help to improve the readability of the data, which can be beneficial for both the retriever and the generator.
Chunking
The chunking process is crucial for the performance of the RAG system. The chunk size can affect the retrieval performance and the generation quality. If the chunk size is too small, it may not capture enough context for retrieval and generation. If the chunk size is too large, it may contain too much irrelevant information, which can affect the retrieval performance. The optimal chunk size depends on the specific use case and the type of data being processed. It is often determined through experimentation and may require tuning based on the specific requirements of the application. The chunking process will effect the recall and precision of the retrieval and all the response speed and efficiency of the system.
Different chunking strategies are effected by mainly three factors: Chunk Size, Overlap, and Splitting Logic. The chunking strategies can be categorized into three main types: Basic Strategies, Splitting Strategies, and Combination Strategies.
graph TD
A[Chunking Strategy]
A1[Size]
A2[Overlap]
A3[Splitting Logic]
A1 --> A
A2 --> A
A3 --> A
A --> B1[Basic Strategies]
A --> B2[Splitting Strategies]
A --> B3[Combination Strategies]
B1 --> C1[Fixed Size Chunking]
B1 --> C2[Overlap Chunking]
B2 --> C3[Recursive Chunking]
B2 --> C4[Document-Specific Chunking]
B2 --> C5[Semantic Chunking]
B3 --> C6[Hybrid Chunking]
- Basic Strategies: These strategies involve simple chunking methods based on fixed size or overlap.
- Fixed Size Chunking: The data is divided into chunks of a fixed size, such as 512 tokens. This method is straightforward but may not capture the context effectively.
- Overlap Chunking: The data is divided into chunks with a certain overlap between them, such as 64 tokens. This method can help capture more context but may lead to redundancy.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_text(text)
Splitting Strategies: These strategies involve more complex chunking methods based on the structure of the data or the semantics of the content.
- Recursive Chunking: The data is recursively divided into smaller chunks until a certain condition is met, such as a maximum chunk size or a specific delimiter. This method can help capture the structure of the data but may require more processing time.
- Document-Specific Chunking: The data is chunked based on the specific structure of the document, such as paragraphs, sections, or sentences. This method can help capture the context effectively but may require custom logic for different types of documents.
- Semantic Chunking: The data is chunked based on the semantic meaning of the content, such as topics or themes. This method can help capture the context effectively but may require more advanced natural language processing techniques, such as using the NLTK library or spaCy.
Combination Strategies: These strategies involve combining different chunking methods to optimize the performance of the RAG system. For example, a combination of fixed size chunking and overlap chunking can be used to capture more context while avoiding redundancy.
Langchain provides a powerful tool for chunking text data, allowing for flexible and customizable chunking strategies.
from langchain.text_splitter import (
CharacterTextSplitter,
RecursiveCharacterTextSplitter,
MarkdownTextSplitter,
PythonCodeTextSplitter,
LatexTextSplitter,
SpacyTextSplitter,
NLTKTextSplitter)
Embedding
Embedding is the process of converting text into a high dimensional numerical representation that can be used by the system. The choice of embedding technique can have a significant impact on the performance of the RAG system.
We can check the MTEB Embedding leaderboard to find the best embedding model for our specific use case. But good benchmark is not the only factor to consider when choosing an embedding model. The choice of embedding model should also take into account the specific requirements of the application, such as the type of data being processed, the computational resources available, and the desired performance metrics. It is important to experiment with different embedding models and evaluate their performance on your specific use case to find the best one for your application.
RAG Database
There are several no-sql and vector databases that can be used to Rag project.
- Key-value databases: These databases store data as key-value pairs, where each key is unique and maps to a specific value. Examples include Redis and DynamoDB.
- Document databases: These databases store data in a document format, such as JSON or BSON
- Graph databases: These databases store data in a graph format, where nodes represent entities and edges represent relationships between entities. Examples include Neo4j and Amazon Neptune.
- Vector databases: These databases are specifically designed to store and retrieve high-dimensional vectors, which are commonly used in RAG systems for embedding representations. Examples include Pinecone, Weaviate, and Faiss.
The core of vector databases lies in their efficient indexing and search mechanisms. To optimize query performance, they employ various algorithms such as hashing, quantization, and graph-based methods. These algorithms construct index structures like Hierarchical Navigable Small World (HNSW) graphs, Product Quantization (PQ), and Locality-Sensitive Hashing (LSH), significantly improving query speed. This search process does not pursue absolute precision but instead trades off between speed and accuracy through Approximate Nearest Neighbor (ANN) algorithms, enabling rapid response.
The index structure of vector databases can be understood as a preprocessing step, similar to creating an index for books in a library to facilitate quick location of desired content. HNSW graphs rapidly narrow the search scope by connecting similar vectors across multi-layered structures. PQ reduces memory footprint and accelerates retrieval by compressing high-dimensional vectors, while LSH facilitates rapid positioning by clustering similar vectors together through hash functions.
The search mechanism of vector databases does not pursue exact matching but instead finds the optimal balance between speed and accuracy through ANN algorithms. By allowing a certain degree of error, ANN algorithms significantly improve search speed while still identifying vectors with high similarity to the query. This strategy is particularly crucial for application scenarios requiring real-time, high-precision responses.
Hybrid Search and Reranking
Hybrid search combines vector search (semantic similarity) with keyword search (lexical matching) to retrieve more relevant and comprehensive results. If the query contains specific keywords, like the person’s name, brand name, rare words or the query itself is very short, key workd search can overcome the limitations of vector search the later based more on the meaning of a phrase.
Reranking is the process of reordering the retrieved results based on their relevance to the query. After the initial retrieval, a reranking model can be applied to further refine the results and improve the relevance of the final output. Reranking can be done using various techniques, such as using a more complex model to evaluate the relevance of each retrieved result or using additional features, such as the length of the retrieved document or the number of times a keyword appears in the retrieved document or just the simple weighted sum of the similarity score and the keyword matching score.
def rerank_results(retrieved_results, query):
reranked_results = []
for result in retrieved_results:
similarity_score = calculate_similarity(result, query)
keyword_matching_score = calculate_keyword_matching(result, query)
final_score = 0.7 * similarity_score + 0.3 * keyword_matching_score
reranked_results.append((result, final_score))
reranked_results.sort(key=lambda x: x[1], reverse=True)
return [result for result, score in reranked_results]
Azure research research blog shows that the hybrid search and reranking can significantly improve the performance of the RAG system Azure AI Search: Outperforming vector search with hybrid retrieval and reranking.

There are several retrieval methods that can be used in the RAG system, including:
| Search Method | Description |
|---|---|
| Vector Search | Converts documents into vector representations and performs semantic similarity search between vectors to find content that is closest in meaning to the query. |
| Keyword Search | Relies on exact keyword matching in text. Suitable for scenarios requiring precise searches of specific terms, phrases, or identifiers. |
| Multi-Query Search | Generates multiple questions related to the original query to expand the search scope. Suitable when the query is unclear or ambiguous. |
| Contextual Compression Search | First retrieves a large number of relevant documents, then compresses their content to retain only core information, making it easier for the model to process and understand. |
| Parent Document Search | Retrieves entire documents instead of just document fragments. Suitable when full document context is required to ensure completeness of information. |
| Multi-Vector Search | Creates multiple vector representations for each document to capture different aspects or features, enabling multi-dimensional information retrieval. |
| Self-Query Search | Understands and reformulates complex queries to improve search precision. Suitable for handling complex queries and better aligning results with user intent. |
Reranking methods include:
- Reciprocal Rank Fusion, RRF
\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}
- R: The set of ranked rsult lists from different retrieval methods.
- r(d): The position of document d in ranking r.
- k: The smoothing parameter, typically set to 60. If d is not present in a ranked list r, then r(d) is considered to be infinity, and the contribution of that ranked list to the RRF score for document d is zero. The RRF method effectively combines the strengths of multiple retrieval methods, giving higher scores to documents that consistently rank well across different methods, while also mitigating the impact of outliers or noise in any single ranked list.
- Weighted Reciprocal Rank Fusion, WRRF There is also a weighted verion of RRF, called Weighted Reciprocal Rank Fusion, WRRF, which assigns different weights to the ranked lists based on their importance or reliability. The formula for WRRF is as follows:
\text{WRRF}(d) = \sum_{r \in R} w_r \cdot \frac{1}{k + r(d)}
- Score-Based Normalization Variant:
\text{RRF}_{\text{score}}(d) = \sum_{r \in R} \frac{s_r(d)}{\sum_{d' \in D} s_r(d')}
- LLM Based: We ca also use LLM to do the reranking, by providing the retrieved results and the query to the LLM and ask it to rank the results based on their relevance to the query. This method can be more effective than traditional reranking methods, as it can take into account the context and semantics of the retrieved results, but it may also be more computationally expensive.