RAG (Retrieval-Augmented Generation)

Introduction

There are three main mode for AI systems:

  1. Retrieval: The system retrieves relevant information from a database or knowledge base.
  2. Generation: The system generates new content based on the input it receives.
  3. Action: The system takes actions based on the input it receives. Like MCP, Agent, etc. Like Camel, the agent can interact with the environment and take actions based on the input it receives.

The RAG is welled grounded by the time I write this article and it includes mainly the first two modes.

How Rag works:

  1. Question: The user asks a question or provides input to the system.
  2. Retrieval: The system retrieves relevant information from a database or knowledge base, where the chunking and embedding are used to find the most relevant information. The knowlege are often saved as vector database, like Pinecone, Weaviate, etc. The embedding is often done by LLMs, like OpenAI’s text-embedding-ada-002.
  3. Generation: The system generates a response based on the retrieved information and the input it received.

Components of RAG

  1. Embedding: The process of converting text into a numerical representation that can be used by the system.
  2. Vector Database: A database that stores the embeddings and allows for efficient retrieval of relevant information.
  3. Retriver: The component that retrieves relevant information from the vector database based on the input it receives.
  4. Generator: The component that generates a response based on the retrieved information and the input it received.

The pipeline of Indexing

  1. Load: Load different format of data into the system. The data can be in different format, like TXT, PDF, CSV, etc.
  2. Chunk: The data is chunked into smaller pieces to make it easier to process and retrieve relevant information.
  3. Embed: The chunked data is embedded into a numerical representation that can be used by the system.
  4. Store: The embedded data is stored in a vector database for efficient retrieval.

The pipeline of Retrieval and generation

  1. Query: The user asks a question or provides input to the system. which include the clarification of the question, split the question into smaller pieces, and rephrase the question to make it easier to retrieve.
  2. Retrieve: Select and ranking the relevant information from the vector database based on the input it receives. The retriever can be a simple keyword search or a more complex semantic search.

The pipeline of Generation

The retrieved information is provided to the generator together with the input question. The GenAI can then do the reponse generation. The generator can be a simple template-based system or a more complex LLM-based system.

How to optimize the RAG system

Format of the data

data can be from different sources and in different formats, like TXT, PDF, CSV, etc. The format of the data can affect the performance of the RAG system. For example, if the data is in a structured format, like CSV, it may be easier to process and retrieve relevant information. If the data is in an unstructured format, like PDF, it may require more processing to extract relevant information. It is important to consider the format of the data when designing and optimizing the RAG system.

From my experience, use Markdown as the format of the data can be very helpful for LL as it concentrate more on the content and less on the format. The markdown format can also help to capture the structure of the data, like headings, lists, etc. which can be useful for retrieval and generation. The markdown format can also help to improve the readability of the data, which can be beneficial for both the retriever and the generator.

Chunking

The chunking process is crucial for the performance of the RAG system. The chunk size can affect the retrieval performance and the generation quality. If the chunk size is too small, it may not capture enough context for retrieval and generation. If the chunk size is too large, it may contain too much irrelevant information, which can affect the retrieval performance. The optimal chunk size depends on the specific use case and the type of data being processed. It is often determined through experimentation and may require tuning based on the specific requirements of the application. The chunking process will effect the recall and precision of the retrieval and all the response speed and efficiency of the system.

Different chunking strategies are effected by mainly three factors: Chunk Size, Overlap, and Splitting Logic. The chunking strategies can be categorized into three main types: Basic Strategies, Splitting Strategies, and Combination Strategies.

graph TD

    A[Chunking Strategy]

    A1[Size]
    A2[Overlap]
    A3[Splitting Logic]

    A1 --> A
    A2 --> A
    A3 --> A

    A --> B1[Basic Strategies]
    A --> B2[Splitting Strategies]
    A --> B3[Combination Strategies]

    B1 --> C1[Fixed Size Chunking]
    B1 --> C2[Overlap Chunking]

    B2 --> C3[Recursive Chunking]
    B2 --> C4[Document-Specific Chunking]
    B2 --> C5[Semantic Chunking]

    B3 --> C6[Hybrid Chunking]
  1. Basic Strategies: These strategies involve simple chunking methods based on fixed size or overlap.
    • Fixed Size Chunking: The data is divided into chunks of a fixed size, such as 512 tokens. This method is straightforward but may not capture the context effectively.
    • Overlap Chunking: The data is divided into chunks with a certain overlap between them, such as 64 tokens. This method can help capture more context but may lead to redundancy.
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_text(text)
  1. Splitting Strategies: These strategies involve more complex chunking methods based on the structure of the data or the semantics of the content.

    • Recursive Chunking: The data is recursively divided into smaller chunks until a certain condition is met, such as a maximum chunk size or a specific delimiter. This method can help capture the structure of the data but may require more processing time.
    • Document-Specific Chunking: The data is chunked based on the specific structure of the document, such as paragraphs, sections, or sentences. This method can help capture the context effectively but may require custom logic for different types of documents.
    • Semantic Chunking: The data is chunked based on the semantic meaning of the content, such as topics or themes. This method can help capture the context effectively but may require more advanced natural language processing techniques, such as using the NLTK library or spaCy.
  2. Combination Strategies: These strategies involve combining different chunking methods to optimize the performance of the RAG system. For example, a combination of fixed size chunking and overlap chunking can be used to capture more context while avoiding redundancy.

Langchain provides a powerful tool for chunking text data, allowing for flexible and customizable chunking strategies.

from langchain.text_splitter import (
  CharacterTextSplitter,
  RecursiveCharacterTextSplitter,
  MarkdownTextSplitter,
  PythonCodeTextSplitter,
  LatexTextSplitter,
  SpacyTextSplitter,
  NLTKTextSplitter)

Embedding

Embedding is the process of converting text into a high dimensional numerical representation that can be used by the system. The choice of embedding technique can have a significant impact on the performance of the RAG system.

We can check the MTEB Embedding leaderboard to find the best embedding model for our specific use case. But good benchmark is not the only factor to consider when choosing an embedding model. The choice of embedding model should also take into account the specific requirements of the application, such as the type of data being processed, the computational resources available, and the desired performance metrics. It is important to experiment with different embedding models and evaluate their performance on your specific use case to find the best one for your application.

RAG Database

There are several no-sql and vector databases that can be used to Rag project.

  1. Key-value databases: These databases store data as key-value pairs, where each key is unique and maps to a specific value. Examples include Redis and DynamoDB.
  2. Document databases: These databases store data in a document format, such as JSON or BSON
  3. Graph databases: These databases store data in a graph format, where nodes represent entities and edges represent relationships between entities. Examples include Neo4j and Amazon Neptune.
  4. Vector databases: These databases are specifically designed to store and retrieve high-dimensional vectors, which are commonly used in RAG systems for embedding representations. Examples include Pinecone, Weaviate, and Faiss.

The core of vector databases lies in their efficient indexing and search mechanisms. To optimize query performance, they employ various algorithms such as hashing, quantization, and graph-based methods. These algorithms construct index structures like Hierarchical Navigable Small World (HNSW) graphs, Product Quantization (PQ), and Locality-Sensitive Hashing (LSH), significantly improving query speed. This search process does not pursue absolute precision but instead trades off between speed and accuracy through Approximate Nearest Neighbor (ANN) algorithms, enabling rapid response.

The index structure of vector databases can be understood as a preprocessing step, similar to creating an index for books in a library to facilitate quick location of desired content. HNSW graphs rapidly narrow the search scope by connecting similar vectors across multi-layered structures. PQ reduces memory footprint and accelerates retrieval by compressing high-dimensional vectors, while LSH facilitates rapid positioning by clustering similar vectors together through hash functions.

The search mechanism of vector databases does not pursue exact matching but instead finds the optimal balance between speed and accuracy through ANN algorithms. By allowing a certain degree of error, ANN algorithms significantly improve search speed while still identifying vectors with high similarity to the query. This strategy is particularly crucial for application scenarios requiring real-time, high-precision responses.

Hybrid Search and Reranking

Hybrid search combines vector search (semantic similarity) with keyword search (lexical matching) to retrieve more relevant and comprehensive results. If the query contains specific keywords, like the person’s name, brand name, rare words or the query itself is very short, key workd search can overcome the limitations of vector search the later based more on the meaning of a phrase.

Reranking is the process of reordering the retrieved results based on their relevance to the query. After the initial retrieval, a reranking model can be applied to further refine the results and improve the relevance of the final output. Reranking can be done using various techniques, such as using a more complex model to evaluate the relevance of each retrieved result or using additional features, such as the length of the retrieved document or the number of times a keyword appears in the retrieved document or just the simple weighted sum of the similarity score and the keyword matching score.

def rerank_results(retrieved_results, query):
    reranked_results = []
    for result in retrieved_results:
        similarity_score = calculate_similarity(result, query)
        keyword_matching_score = calculate_keyword_matching(result, query)
        final_score = 0.7 * similarity_score + 0.3 * keyword_matching_score
        reranked_results.append((result, final_score))
    reranked_results.sort(key=lambda x: x[1], reverse=True)
    return [result for result, score in reranked_results]

Azure research research blog shows that the hybrid search and reranking can significantly improve the performance of the RAG system Azure AI Search: Outperforming vector search with hybrid retrieval and reranking.

Hybrid retrieval with semantic ranking outperforms vector-only search

There are several retrieval methods that can be used in the RAG system, including:

Search MethodDescription
Vector SearchConverts documents into vector representations and performs semantic similarity search between vectors to find content that is closest in meaning to the query.
Keyword SearchRelies on exact keyword matching in text. Suitable for scenarios requiring precise searches of specific terms, phrases, or identifiers.
Multi-Query SearchGenerates multiple questions related to the original query to expand the search scope. Suitable when the query is unclear or ambiguous.
Contextual Compression SearchFirst retrieves a large number of relevant documents, then compresses their content to retain only core information, making it easier for the model to process and understand.
Parent Document SearchRetrieves entire documents instead of just document fragments. Suitable when full document context is required to ensure completeness of information.
Multi-Vector SearchCreates multiple vector representations for each document to capture different aspects or features, enabling multi-dimensional information retrieval.
Self-Query SearchUnderstands and reformulates complex queries to improve search precision. Suitable for handling complex queries and better aligning results with user intent.

Reranking methods include:

  1. Reciprocal Rank Fusion, RRF
\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}
  • R: The set of ranked rsult lists from different retrieval methods.
  • r(d): The position of document d in ranking r.
  • k: The smoothing parameter, typically set to 60. If d is not present in a ranked list r, then r(d) is considered to be infinity, and the contribution of that ranked list to the RRF score for document d is zero. The RRF method effectively combines the strengths of multiple retrieval methods, giving higher scores to documents that consistently rank well across different methods, while also mitigating the impact of outliers or noise in any single ranked list.
  1. Weighted Reciprocal Rank Fusion, WRRF There is also a weighted verion of RRF, called Weighted Reciprocal Rank Fusion, WRRF, which assigns different weights to the ranked lists based on their importance or reliability. The formula for WRRF is as follows:
\text{WRRF}(d) = \sum_{r \in R} w_r \cdot \frac{1}{k + r(d)}
  1. Score-Based Normalization Variant:
\text{RRF}_{\text{score}}(d) = \sum_{r \in R} \frac{s_r(d)}{\sum_{d' \in D} s_r(d')}
  1. LLM Based: We ca also use LLM to do the reranking, by providing the retrieved results and the query to the LLM and ask it to rank the results based on their relevance to the query. This method can be more effective than traditional reranking methods, as it can take into account the context and semantics of the retrieved results, but it may also be more computationally expensive.

Prompt Engineering

SuperCLUE is a benchmark for evaluating the performance of language models on a variety of natural language understanding tasks. It includes a set of tasks that require different types of reasoning and understanding, such as commonsense reasoning, reading comprehension, and natural language inference. The SuperCLUE benchmark can be used to evaluate the performance of the RAG system and to identify areas for improvement in the prompt engineering process.

Prompt contains four parts normally:

  1. Instruction: The instruction provides the model with a clear and concise description of the task it needs to perform. It should be specific and unambiguous to ensure that the model understands what is expected of it.
  2. Context: The context provides the model with relevant information that can help it perform the task. This can include background information, examples, or any other relevant data that can help the model understand the task better.
  3. Input Data: The input data is the specific information that the model needs to process in order to perform the task. This can include text, images, or any other type of data that is relevant to the task.
  4. Output Format: The output format specifies how the model should structure its response. This can include specific formatting requirements, such as JSON or XML, or it can simply specify the type of response that is expected, such as a summary or a list of key points.

some different ways to optimize the prompt engineering process include:

  1. Few-shot Learning: Providing the model with a few examples of the task can help it understand the task better and improve its performance. This can be done by including a few examples in the instruction or by providing a separate section for examples in the prompt.
  2. Chain-of-Thought Prompting: This technique involves breaking down the task into smaller steps and providing the model with a clear chain of thought to follow. This can help the model understand the reasoning process and improve its performance on complex tasks.
  3. Include failure raise: If we do several retries, we can include the previous failed attempts in the prompt to help the model learn from its mistakes and improve its performance on subsequent attempts.
  4. Define the exception: If the model didn’t get the information, we can define the exception in the prompt to help the model stop hallucinating.
  5. Output format design: Pydantic model can be used to define the output format of the model’s response, which can help ensure that the response is structured in a way that is easy to parse and use for downstream tasks. By defining a clear output format, we can also help the model understand what type of response is expected and improve its performance on the task.

Data quality

Data preprocessing

Preprocessing the data can help improve the quality of the data and the performance of the RAG system. This can include tasks such as removing stop words, stemming or lemmatization, and removing special characters,it can also remove the redundant information especially if the data comes from the website. In this way, we can reduce the noise in the data and improve the noise-to-signal ratio, which can help the model to learn better and improve its performance on the task.

extend the query

  1. Extending the query with LLM can help improve the performance of the RAG system by providing vector to search in more space.
  2. Self querying can help the model to understand the query better By allowing the model to reformulate the query, we can help it to better capture the intent of the user and capture the key words to do the search.

cotext conpression

Context compression is a technique used to reduce the amount of information that needs to be processed by the model while still retaining the most relevant and important information.

Evaluation

  1. LLM Evaluation: We can use LLM to evaluate the performance of the RAG system by providing it with a set of test queries and asking it to evaluate the relevance and quality of the retrieved results. This can be done by providing the LLM with a set of evaluation criteria, such as relevance, coherence, and informativeness, and asking it to score the retrieved results based on these criteria.
  2. Human Evaluation: We can also use human evaluation to assess the performance of the RAG system by asking human evaluators to rate the relevance and quality of the retrieved results. This can be done by providing the evaluators with a set of evaluation criteria and asking them to score the retrieved results based on these criteria. Human evaluation can provide valuable insights into the performance of the RAG system and can help identify areas for improvement.

Some common evaluation indicators:

  1. CR (Context Relevancy): This one is to evaluate the retrieved context’s relevance to the query.
  2. AR (Answer Relevancy): The answer relevancy.
  3. F (Faithfulness): The answer’s faithfulness. We can defined our stardard for these indicators and use them to evaluate the performance of the RAG system. For example, we can define a standard for CR as follows:
  • 0: The retrieved context is completely irrelevant to the query.
  • 1: The retrieved context is somewhat relevant to the query, but contains a significant amount of irrelevant information.
  • 2: The retrieved context is mostly relevant to the query, but contains some irrelevant information
  • 3: The retrieved context is highly relevant to the query, with minimal irrelevant information.
  • 4: The retrieved context is completely relevant to the query, with no irrelevant information.

Continuous evaluation

The Rag system needs continuous evaluation along with the user feedback and the performance of the system. We have metrice like:

  • Precision: The percentage of retrieved results that are relevant to the query. This can be used to evaluate the accuracy of the retrieved results and the effectiveness of the retrieval process.
  • Recall: The percentage of relevant results that are retrieved by the system. This can be used to evaluate the completeness of the retrieved results and the effectiveness of the retrieval process.
  • CTR (Click-Through Rate): The percentage of retrieved results that are clicked by the user. This can be used to evaluate the relevance of the retrieved results to the user’s query.
  • Dwell Time: The amount of time a user spends on a retrieved result
  • User Satisfaction: This can be measured through surveys or feedback forms to assess the user’s satisfaction with the retrieved results and the overall performance of the RAG system. User satisfaction can provide valuable insights into the effectiveness of the system and help identify areas for improvement.
  • A/B Testing: This involves comparing the performance of different versions of the RAG system by randomly assigning users to different groups and measuring their interactions with the system. A/B testing can help identify which version of the system performs better and can provide insights into user preferences and behavior.
  • Error Analysis: This involves analyzing the errors made by the RAG system to identify patterns and areas for improvement. By understanding the types of errors that occur, we can make targeted improvements to the system to enhance its performance.

Modular RAG system

A modular RAG system is a system that is designed to be flexible and adaptable, allowing for different components to be easily swapped out or modified without affecting the overall functionality of the system. This can be achieved through the use of APIs, microservices, or other modular design principles. Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks

Graph RAG

Knowledge Graph is popular in the RAG system as it can help to capture the relationships between different pieces of information and provide a more structured representation of the knowledge. GraphRAG: Unlocking LLM discovery on narrative private data From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Graph RAG first use knowledge graph to retrieve the entities and the relationships between them, then use the retrieved information to generate a response.

Why use Graph RAG

  1. We can eaily capture the relationships between different pieces of information.
  2. With the help of the knowledge graph, it can do more steps reasoning.
  3. We can also use the knowledge graph to illustrate the relationship between different pieces of information. It improve the explainability and traceability of the RAG system and help users to understand the retrieved information better.

What is knowledge graph

A knowledge graph often includes:

  1. entities: The nodes in the graph that represent concepts, objects, or things. For example, in a knowledge graph about movies, entities could include actors, directors, and movie titles.
  2. Attributes: The properties or characteristics of the entities. For example, in a knowledge graph about movies, attributes could include the release date, genre, and box office revenue.
  3. Relationships: The connections between entities that represent the relationships between them. For example, in a knowledge graph about movies, relationships could include “acted in”, “directed by”, and “produced by”.

Steps to build a Graph RAG system

The GraphRAG Manifesto: Adding Knowledge to GenAI

  1. Entity Recognition: Identify key entities from text or data sources.
  2. Relation Extraction: Determine the relationships between entities, possibly using natural language processing techniques.
  3. Triple Generation: Represent entities and relationships in the form of (subject, predicate, object).
  4. Graph Storage: Use a graph database or a dedicated storage system to store the knowledge graph.