In the previous chapter, we learned how to process data and store embeddings in a vector store. This chapter focuses on efficiently retrieving the most relevant embeddings and document chunks in response to a user query. These retrieved documents are added as context to the prompt, improving the accuracy of the LLM’s output.
This workflow—embedding a user query, retrieving similar documents, and passing them to the LLM as context—is known as retrieval-augmented generation (RAG).
RAG is a core technique for building accurate, efficient, and up-to-date chat-enabled LLM applications. This chapter introduces both foundational and advanced RAG strategies, covering different data sources (such as vector stores and databases) and data types (structured and unstructured). We begin by defining RAG and outlining its key benefits.
Table of Contents
Introducing Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) improves the accuracy of LLM outputs by providing relevant context from external sources. The term was introduced by Meta AI researchers, who showed that RAG-enabled models produce more factual and specific responses than models relying solely on pretrained data.
Without RAG, an LLM depends entirely on its training data, which may be outdated. For example, when asked who most recently won the men’s FIFA World Cup, an LLM may incorrectly answer “France (2018)” instead of the correct and more recent winner, Argentina (2022). While this mistake is harmless in a simple example, similar hallucinations can be risky in real-world decision-making.
RAG addresses this issue by supplying the LLM with up-to-date, factual context. When relevant information—such as a paragraph from Wikipedia stating that Argentina won the 2022 World Cup—is added to the prompt, the model can generate a correct answer.
Manually copying and pasting context, however, is neither practical nor scalable. A production RAG system automates this process by retrieving relevant information based on a user’s query, appending it as context, and then generating a response from the LLM.
Retrieving Relevant Documents
A RAG system typically consists of three stages: indexing, retrieval, and generation. Indexing preprocesses data and stores embeddings in a vector store, retrieval fetches relevant document chunks for a user query, and generation combines those chunks with the prompt sent to the model.
Indexing was covered in Chapter 2. Below is a brief example, starting with indexing:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
# Load the document, split it into chunks
raw_documents = TextLoader('./test.txt').load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=200)
documents = text_splitter.split_documents(raw_documents)
# embed each chunk and insert it into the vector store
model = OpenAIEmbeddings()
connection = 'postgresql+psycopg://langchain:langchain@localhost:6024/langchain'
db = PGVector.from_documents(documents, model, connection=connection)
Once indexing is complete, the retrieval stage performs similarity searches between the user’s query and stored embeddings to identify relevant document chunks.
Retrieval involves:
- Embedding the user query
- Finding the most similar embeddings in the vector store
- Returning the corresponding text chunks
This can be implemented as follows:
# create retriever
retriever = db.as_retriever()
# fetch relevant documents
docs = retriever.invoke("""Who are the key figures in the ancient greek
history of philosophy?""")
The as_retriever method abstracts query embedding and similarity search. You can also control how many documents are returned using the k parameter:
# create retriever with k=2
retriever = db.as_retriever(search_kwargs={"k": 2})
# fetch the 2 most relevant documents
docs = retriever.invoke("""Who are the key figures in the ancient greek history
of philosophy?""")
Using a smaller k often improves performance and reduces cost while minimizing irrelevant context that can lead to hallucinations. With retrieval complete, the system is ready for the final generation stage.
Generating LLM Predictions Using Relevant Documents
After retrieving relevant documents for a user query, the final step in a RAG pipeline is to include those documents as context in the prompt and invoke the LLM to generate a response (Figure 1).

Below is a continuation of the previous example, showing how retrieved documents are injected into the prompt and passed to the model:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
retriever = db.as_retriever()
prompt = ChatPromptTemplate.from_template("""Answer the question based only on
the following context:
{context}
Question: {question}
""")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
chain = prompt | llm
# fetch relevant documents
docs = retriever.get_relevant_documents("""Who are the key figures in the
ancient greek history of philosophy?""")
# run
chain.invoke({
"context": docs,
"question": """Who are the key figures in the ancient greek history of philosophy?"""
})
Here, the prompt template uses dynamic variables for context and question, the LLM is configured with deterministic output (temperature=0), and the prompt and model are composed into a chain.
This logic can be encapsulated into a single reusable function:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import chain
retriever = db.as_retriever()
prompt = ChatPromptTemplate.from_template("""Answer the question based only on
the following context:
{context}
Question: {question}
""")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
@chain
def qa(input):
# fetch relevant documents
docs = retriever.get_relevant_documents(input)
# format prompt
formatted = prompt.invoke({"context": docs, "question": input})
# generate answer
answer = llm.invoke(formatted)
return answer
# run
qa.invoke("Who are the key figures in the ancient greek history of philosophy?")
The @chain decorator turns the function into a runnable pipeline that retrieves documents, formats the prompt, and generates an answer in one step.
You can also return the retrieved documents alongside the answer for inspection or debugging:
@chain
def qa(input):
# fetch relevant documents
docs = retriever.get_relevant_documents(input)
# format prompt
formatted = prompt.invoke({"context": docs, "question": input})
# generate answer
answer = llm.invoke(formatted)
return {"answer": answer, "docs": docs}
At this point, you’ve built a basic RAG system capable of answering questions using external data. In the next section, we explore advanced, research-backed strategies for making RAG systems robust and production-ready.
Query Transformation
A basic RAG system depends heavily on the quality of a user’s query. In production, queries are often ambiguous, incomplete, or poorly worded, which can lead to irrelevant retrieval and hallucinated responses.
Query transformation techniques address this issue by modifying the user’s input before retrieval. These strategies vary in how much they abstract or refine the original query (Figure 2). We begin with a mid-level approach: Rewrite-Retrieve-Read.

Rewrite-Retrieve-Read
The Rewrite-Retrieve-Read strategy prompts an LLM to rewrite the user’s query into a clearer, more focused form before retrieval. This helps the retriever ignore irrelevant details and fetch more useful context.
Below is the same chain from the previous section, invoked with a poorly worded query:
@chain
def qa(input):
# fetch relevant documents
docs = retriever.get_relevant_documents(input)
# format prompt
formatted = prompt.invoke({"context": docs, "question": input})
# generate answer
answer = llm.invoke(formatted)
return answer
qa.invoke("""Today I woke up and brushed my teeth, then I sat down to read the
news. But then I forgot the food on the cooker. Who are some key figures in
the ancient greek history of philosophy?""")
Because the query contains irrelevant information, the retriever fails to return useful context.
Now let’s apply query rewriting before retrieval:
rewrite_prompt = ChatPromptTemplate.from_template("""Provide a better search
query for web search engine to answer the given question, end the queries
with ’**’. Question: {x} Answer:""")
def parse_rewriter_output(message):
return message.content.strip('"').strip("**")
rewriter = rewrite_prompt | llm | parse_rewriter_output
@chain
def qa_rrr(input):
# rewrite the query
new_query = rewriter.invoke(input)
# fetch relevant documents
docs = retriever.get_relevant_documents(new_query)
# format prompt
formatted = prompt.invoke({"context": docs, "question": input})
# generate answer
answer = llm.invoke(formatted)
return answer
# run
qa_rrr.invoke("""Today I woke up and brushed my teeth, then I sat down to read
the news. But then I forgot the food on the cooker. Who are some key
figures in the ancient greek history of philosophy?""")
In this approach, the LLM rewrites the noisy user input into a focused search query, which is then used for retrieval. This results in more relevant documents and a more accurate final answer.
While Rewrite-Retrieve-Read can significantly improve retrieval quality, it introduces additional latency because it requires two sequential LLM calls. Despite this tradeoff, it is a powerful and flexible technique that works with vector stores, web search, and other retrieval methods.
Multi-Query Retrieval
A single user query may not fully capture all the information needed for a complete answer. Multi-query retrieval addresses this by having an LLM generate multiple variations of the original question, retrieving documents for each query in parallel, and combining the results as context for generation (Figure 3).

This approach is especially useful when answering questions that benefit from multiple perspectives.
Below is an example implementation. First, we prompt the LLM to generate alternative queries:
from langchain.prompts import ChatPromptTemplate
perspectives_prompt = ChatPromptTemplate.from_template("""You are an AI language
model assistant. Your task is to generate five different versions of the
given user question to retrieve relevant documents from a vector database.
By generating multiple perspectives on the user question, your goal is to
help the user overcome some of the limitations of the distance-based
similarity search. Provide these alternative questions separated by
newlines. Original question: {question}""")
def parse_queries_output(message):
return message.content.split('\n')
query_gen = perspectives_prompt | llm | parse_queries_output
Next, we retrieve documents for each generated query in parallel and deduplicate the results:
def get_unique_union(document_lists):
# Flatten list of lists, and dedupe them
deduped_docs = {
doc.page_content: doc
for sublist in document_lists for doc in sublist
}
# return a flat list of unique docs
return list(deduped_docs.values())
retrieval_chain = query_gen | retriever.batch | get_unique_union
Because multiple related queries often return overlapping results, deduplication ensures each document appears only once. The .batch method enables parallel retrieval, improving performance.
Finally, we construct a prompt using the combined retrieved documents and generate the answer:
prompt = ChatPromptTemplate.from_template("""Answer the following question based
on this context:
{context}
Question: {question}
""")
@chain
def multi_query_qa(input):
# fetch relevant documents
docs = retrieval_chain.invoke(input)
# format prompt
formatted = prompt.invoke({"context": docs, "question": input})
# generate answer
answer = llm.invoke(formatted)
return answer
# run
multi_query_qa.invoke("""Who are some key figures in the ancient greek history
of philosophy?""")
This pattern closely mirrors earlier QA chains, with the added complexity isolated inside retrieval_chain. Encapsulating each retrieval strategy this way makes advanced RAG techniques easy to reuse and combine.
RAG-Fusion
RAG-Fusion extends multi-query retrieval by adding a final reranking step using reciprocal rank fusion (RRF). Instead of simply merging retrieved documents, RRF combines rankings from multiple queries into a single unified ranking, promoting documents that consistently appear near the top across different query perspectives. This makes it effective for aggregating results with different score distributions.
We start by generating multiple search queries from the user’s input:
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
prompt_rag_fusion = ChatPromptTemplate.from_template("""You are a helpful
assistant that generates multiple search queries based on a single input
query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):""")
def parse_queries_output(message):
return message.content.split('\n')
llm = ChatOpenAI(temperature=0)
query_gen = prompt_rag_fusion | llm | parse_queries_output
Next, we retrieve documents for each generated query and rerank them using reciprocal rank fusion. RRF assigns each document a score based on its rank across multiple result lists and produces a single reranked list:
def reciprocal_rank_fusion(results: list[list], k=60):
"""reciprocal rank fusion on multiple lists of ranked documents
and an optional parameter k used in the RRF formula
"""
# Initialize a dictionary to hold fused scores for each document
# Documents will be keyed by their contents to ensure uniqueness
fused_scores = {}
documents = {}
# Iterate through each list of ranked documents
for docs in results:
# Iterate through each document in the list,
# with its rank (position in the list)
for rank, doc in enumerate(docs):
# Use the document contents as the key for uniqueness
doc_str = doc.page_content
# If the document hasn't been seen yet,
# - initialize score to 0
# - save it for later
if doc_str not in fused_scores:
fused_scores[doc_str] = 0
documents[doc_str] = doc
# Update the score of the document using the RRF formula:
# 1 / (rank + k)
fused_scores[doc_str] += 1 / (rank + k)
# Sort the documents based on their fused scores in descending order
# to get the final reranked results
reranked_doc_strs = sorted(
fused_scores, key=lambda d: fused_scores[d], reverse=True
)
# retrieve the corresponding doc for each doc_str
return [
documents[doc_str]
for doc_str in reranked_doc_strs
]
retrieval_chain = generate_queries | retriever.batch | reciprocal_rank_fusion
The parameter k controls how much influence lower-ranked documents have on the final ranking: higher values give them more weight.
Finally, we combine the RRF-based retrieval chain with the standard generation step:
prompt = ChatPromptTemplate.from_template("""Answer the following question based
on this context:
{context}
Question: {question}
""")
llm = ChatOpenAI(temperature=0)
@chain
def multi_query_qa(input):
# fetch relevant documents
docs = retrieval_chain.invoke(input)
# format prompt
formatted = prompt.invoke({"context": docs, "question": input})
# generate answer
answer = llm.invoke(formatted)
return answer
multi_query_qa.invoke("""Who are some key figures in the ancient greek history
of philosophy?""")
RAG-Fusion excels at handling complex queries by broadening retrieval, reranking results intelligently, and surfacing the most consistently relevant documents—often enabling richer and more serendipitous answers.
Hypothetical Document Embeddings (HyDE)
Hypothetical Document Embeddings (HyDE) improve retrieval by first having an LLM generate a hypothetical document that answers the user’s question. This document is then embedded and used for similarity search. Because the generated passage is closer in meaning to relevant source documents than the raw query, retrieval quality often improves (Figure 4).

We begin by prompting the LLM to generate a hypothetical document:
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
prompt_hyde = ChatPromptTemplate.from_template("""Please write a passage to
answer the question.\n Question: {question} \n Passage:""")
generate_doc = (
prompt_hyde | ChatOpenAI(temperature=0) | StrOutputParser()
)
Next, we embed the hypothetical document and retrieve similar documents from the vector store:
retrieval_chain = generate_doc | retriever
Finally, the retrieved documents are added as context to the prompt and passed to the model to generate the final answer:
prompt = ChatPromptTemplate.from_template("""Answer the following question based
on this context:
{context}
Question: {question}
""")
llm = ChatOpenAI(temperature=0)
@chain
def qa(input):
# fetch relevant documents from the hyde retrieval chain defined earlier
docs = retrieval_chain.invoke(input)
# format prompt
formatted = prompt.invoke({"context": docs, "question": input})
# generate answer
answer = llm.invoke(formatted)
return answer
qa.invoke("""Who are some key figures in the ancient greek history of
philosophy?""")
Query Routing
In production RAG systems, data may reside in multiple sources, such as different vector stores or databases. Query routing forwards a user’s query to the most appropriate data source.
Logical Routing
Logical routing gives the LLM knowledge of available data sources and lets it reason which source to use for a given query . Function-calling models like GPT-3.5 Turbo can output structured results to select the correct route.
from typing import Literal
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI
# Data model
class RouteQuery(BaseModel):
"""Route a user query to the most relevant datasource."""
datasource: Literal["python_docs", "js_docs"] = Field(
...,
description="""Given a user question, choose which datasource would be
most relevant for answering their question""",
)
# LLM with function call
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
structured_llm = llm.with_structured_output(RouteQuery)
# Prompt
system = """You are an expert at routing a user question to the appropriate data
source. Based on the programming language the question is referring to, route it to
the relevant data source."""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
# Define router
router = prompt | structured_llm
Invoke the LLM to extract the relevant data source:
question = """Why doesn't the following code work:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages(["human", "speak in {language}"])
prompt.invoke("french")
"""
result = router.invoke({"question": question})
result.datasource
# "python_docs"
Finally, route the query to the appropriate chain:
def choose_route(result):
if "python_docs" in result.datasource.lower():
### Logic here
return "chain for python_docs"
else:
### Logic here
return "chain for js_docs"
full_chain = router | RunnableLambda(choose_route)
This approach makes the system resilient to minor deviations in LLM output (e.g., casing or extra characters) and allows logical routing across multiple vector stores, databases, or APIs.
Semantic Routing
Semantic routing selects the most relevant data source by embedding representations of different prompts or sources and performing vector similarity search with the user’s query. The prompt most similar to the query is then used to guide the LLM.
from langchain.utils.math import cosine_similarity
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import chain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Two prompts representing different domains
physics_template = """You are a very smart physics professor. You are great at
answering questions about physics in a concise and easy-to-understand manner.
When you don't know the answer to a question, you admit that you don't know.
Here is a question:
{query}"""
math_template = """You are a very good mathematician. You are great at answering
math questions. You are so good because you are able to break down hard
problems into their component parts, answer the component parts, and then
put them together to answer the broader question.
Here is a question:
{query}"""
# Embed prompts
embeddings = OpenAIEmbeddings()
prompt_templates = [physics_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)
# Route question to the most relevant prompt
@chain
def prompt_router(query):
# Embed question
query_embedding = embeddings.embed_query(query)
# Compute similarity
similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
# Pick the prompt most similar to the input question
most_similar = prompt_templates[similarity.argmax()]
return PromptTemplate.from_template(most_similar)
semantic_router = (
prompt_router
| ChatOpenAI()
| StrOutputParser()
)
print(semantic_router.invoke("What's a black hole"))
This method ensures the user’s query is routed to the most contextually relevant prompt before invoking the LLM.
Query Construction
RAG is powerful for unstructured data, but production data often contains structured fields (e.g., in relational databases) or metadata attached to vector embeddings. Query construction converts a natural language query into the appropriate query format for structured or metadata-based retrieval.
Text-to-Metadata Filter
Vector stores allow filtering based on metadata. LangChain’s SelfQueryRetriever leverages an LLM to extract structured filters from a user’s query and combine them with semantic search:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI
fields = [
AttributeInfo(
name="genre",
description="The genre of the movie",
type="string or list[string]",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="director",
description="The name of the movie director",
type="string",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
description = "Brief summary of a movie"
llm = ChatOpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm, db, description, fields,
)
print(retriever.invoke(
"What's a highly rated (above 8.5) science fiction film?"
))
This retriever automatically:
- Generates a metadata filter from the user’s query.
- Produces a rewritten query for semantic search.
- Applies the filter on documents’ metadata.
- Performs a similarity search on the vector store using the filtered documents.
This approach ensures structured fields and semantic search work together for accurate RAG results.
Text-to-SQL
Relational databases require structured SQL queries, which don’t naturally align with user questions. RAG systems can leverage LLMs to generate SQL safely and accurately. Key strategies include:
- Database description: Provide the LLM with table schemas (
CREATE TABLE) and optionally a few example rows. - Few-shot examples: Include sample question-to-SQL mappings in the prompt to guide the LLM.
Python Example
from langchain_community.tools import QuerySQLDatabaseTool
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
# Connect to your database
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-4", temperature=0)
# Convert question to SQL query
write_query = create_sql_query_chain(llm, db)
# Execute SQL query
execute_query = QuerySQLDatabaseTool(db=db)
# Combine query generation and execution
chain = write_query | execute_query
# Invoke the chain
chain.invoke('How many employees are there?')
Security considerations for production:
- Use a read-only database user.
- Limit accessible tables to only those you want to expose.
- Add query time-outs to prevent expensive queries from overloading resources.
This approach enables natural language queries to safely retrieve structured data from databases.
Summary
This chapter covered state-of-the-art retrieval-augmented generation (RAG) techniques to improve the accuracy and relevance of LLM outputs by efficiently fetching and synthesizing relevant documents.
Key takeaways:
- Query Transformation
- Rewrite, decompose, or expand user queries to improve retrieval quality.
- Techniques: Rewrite-Retrieve-Read (RRR), Multi-Query Retrieval, RAG-Fusion, HyDE.
- Query Construction
- Convert natural language queries into structured queries for databases or metadata filters.
- Tools:
SelfQueryRetrieverfor vector stores, Text-to-SQL pipelines for relational databases.
- Query Routing
- Dynamically select the appropriate data source for a query.
- Methods: Logical routing with structured LLM outputs, Semantic routing with embedding similarity.
- Integration in Python
- Use LangChain’s
ChatOpenAI,Retrievers,PromptTemplates, andChainsto fetch relevant documents and generate responses. - Combine retrieval, reranking, and LLM output generation in a single pipeline.
- Use LangChain’s
A robust production-ready RAG system combines these strategies with indexing and optimization to provide accurate, up-to-date answers.
Next step (Chapter 4): Add memory to your AI chatbot for multi-turn conversations