Building a Basic RAG System: From Parsing to Answer Generation
Introduction#
RAG (Retrieval-Augmented Generation) is a tool that expands the capabilities of language models (LLMs) by “connecting” them to a knowledge base of any size. It enables enriching model responses with relevant information from documents, overcoming context window limitations and ensuring data relevance.
Key Stages of RAG System Development#
The development path of a basic RAG system includes four key stages:
-
Parsing: Preparing data for the knowledge base โ collecting documents, converting them to text format, and cleaning them of redundant information.
-
Ingestion: Creating and populating the knowledge base with structured information from processed documents.
-
Retrieval: Developing a mechanism that finds and returns relevant data in response to a user query. Typically involves semantic search across a vector database.
-
Answering: Enriching the user’s question prompt with retrieved data, sending it to an LLM, and generating the final response.
flowchart TB %% Ingestion pipeline subgraph Ingestion ["๐ฅ Ingestion Pipeline"] direction LR PDF["๐ PDF/DOC/TXT/MD Files"]:::box TEXT["๐งน Text Cleaning & Chunking"]:::box VDB1[("๐พ Vector Databases")]:::storage PDF -->|parsing text| TEXT TEXT -->|embed & store| VDB1 end %% Answering pipeline subgraph Answering ["๐ค Answering Pipeline"] direction TB Question["โ User Question"]:::focused %% Retrieval branch subgraph Retrieval ["๐ Retrieval Process"] direction TB VDB2[("๐พ Vector Database")]:::storage Rerank{"๐ LLM Re-ranking"}:::box Context["๐ Relevant Context"]:::box Question -->|query embeddings| VDB2 VDB2 -->|retrieve candidates| Rerank Rerank -->|select top results| Context end %% Prompt branch subgraph Prompting ["โ๏ธ Prompt Generation"] direction TB PromptColl["๐ Prompt Collection"]:::box PromptTpl["โ๏ธ Prompt Template"]:::box Question -.->|routing to relevant prompt| PromptColl PromptColl -->|select prompt template| PromptTpl end %% Final assembly Request["๐ง LLM Request"]:::box Answer["โ Final Answer"]:::focused Context -->|provide context| Request PromptTpl -->|structure query| Request Question -->|pass-through| Request Request -->|generate response| Answer end %% Cross-system routing VDB1 -.->|route to relevant database| VDB2
The diagram above summarizes the RAG pipeline. The right side (Ingestion) shows how raw files are parsed, cleaned, and chunked into text before being embedded into a vector database. The left side (Answering) shows how a userโs question triggers the retrieval of relevant context (with optional reranking), the selection of a prompt template, and the final LLM request that produces the answer.
1. Parsing: Overcoming Conversion Challenges#
The first step is to convert source documents (PDFs, Word docs, etc.) into a clean text format. Parsing PDF documents is a non-trivial task with multiple technical challenges:
- Preserving table structures
- Correct processing of formatting elements (headings, lists)
- Recognition of multi-column text
- Processing charts, images, formulas, headers and footers
2. Ingestion: Creating a Knowledge Base#
After converting PDFs to text format (Markdown) and cleaning them, it’s necessary to create databases for storing and searching information.
Chunking Strategy#
The simplest approach is to use a document page as a unit of information storage (chunk). However, to improve search accuracy, it’s advisable to use smaller semantic units. Practice shows that information sized at 8-10 sentences (200โ500 words) is usually sufficient to form a complete answer, and smaller chunk size increases relevance during search.
Vectorization#
For efficient searching, instead of creating one general database, a separate vector database is created for each document (for example, for each company report). This prevents mixing information about different companies and simplifies finding the necessary data in the context of a specific document.
3. Retrieval: Finding Relevant Information#
The Retriever is a search system that takes a query and returns relevant text containing information for the answer. The quality of this component is critically important: if the necessary information doesn’t make it into the LLM’s context, a quality answer is impossible โ “Junk in - Junk out.”
flowchart LR %% Nodes with icons Question["โ Question"]:::focused EmbModel["๐งฎ Embedding model"]:::box VectorDB["๐พ Vector Database"]:::db ParentPage["๐ Parent Page Retrieval"]:::box Reranking["๐ LLM Reranking"]:::box Context["๐ Context for answering"]:::focused %% Connections with descriptive labels Question -->|"What is the Capital of France?"|Process EmbModel -->|"[Vector]"|VectorDB VectorDB -->|"Top N chunks"|ParentPage ParentPage -->|"Top N pages"|Reranking Process -->|"Top N pages"|Context %% Add subgraph to organize flow subgraph Process [" Retrieval Process "] EmbModel VectorDB ParentPage Reranking end
In the flowchart above, the userโs question is embedded and used to query the vector store. The system retrieves the top-N similar chunks (for example, N=30). These chunks are then traced back to their parent pages or documents. A re-ranking step (such as a cross-encoder or even an LLM-based scorer) is applied to these pages, and the highest-ranked ones are selected. The final output is the top-K pages (e.g. K=10) of context, combined into a unified block for the LLM. This context, along with the question, forms the prompt for answer generation.
Methods for Improving Search Quality#
Hybrid Search: Vector Search + BM25
Hybrid search combines semantic vector-based search with classic BestMatch25 text search based on keywords. It takes into account both the meaning of the text and exact matches of words from the query. Results from both methods are combined and ranked according to a composite relevance score.
Cross-encoder Reranking
Reranking vector search results using a Cross-encoder model provides a more accurate relevance assessment. Unlike comparing texts through their vector representations, where some information is lost during vectorization, cross-encoders evaluate semantic similarity between two texts directly, giving a more accurate result.
LLM Reranking
This approach uses a language model to evaluate relevance. The text and question are sent to an LLM with a request to assess the usefulness of the text for answering on a scale from 0 to 1. Previously, this approach was impractical due to the high cost of quality LLMs, but with the emergence of fast and efficient models, it has become a practical solution.
Parent Page Retrieval
After finding the most relevant chunks, this method uses them only as pointers to pages that then go into the query context. This allows including not only the directly relevant fragment but also useful surrounding information from the page.
Architecture of the Final Retriever#
- Vectorizing the user’s question
- Searching for the top 30 relevant chunks based on the question vector
- Extracting corresponding pages using chunk metadata (with deduplication)
- Passing pages to the LLM reranker to refine relevance assessment
- Adjusting the relevance score of pages
- Returning the top 10 pages with page number information and combining them into a unified context
4. Answering: Generating the Response#
In the final stage, relevant information extracted during the Retrieval phase is combined with the user’s question and sent to the language model. The LLM analyzes the provided information and generates the final answer based on relevant data from the documents.
Methods for Improving Answer Quality#
Routing Multi-component Queries
When working with complex queries requiring information from different sources or documents, an effective solution is to decompose them into sub-queries. The system analyzes the original question, breaks it down into components, directs each to the appropriate knowledge base, and then combines the results to form a comprehensive answer.
Chain of Thoughts (CoT)
CoT is a way to significantly increase the quality of answers by making the model “reason” before producing the final result. Instead of giving an answer immediately, the LLM generates a chain of intermediate steps that help arrive at the answer. This approach is particularly effective for complex, multi-stage queries requiring logical reasoning.
Structured Outputs
SO (Structured Outputs) is the ability to specify a strict response format for the model. It’s usually passed to the API as a separate field in the form of Pydantic or JSON schema. This method forces the model to always respond with valid JSON that fully corresponds to the specified format, which significantly simplifies subsequent processing and integration of responses into various systems.
One-shot Prompts
This is another common technique that is quite simple and effective: if, in addition to instructions, you add an example answer to the prompt, the quality and consistency of responses will significantly improve. The model receives a clear example of what format and style of answer is expected, which helps it better meet the requirements of the task.
Conclusion#
Creating an effective RAG system is an iterative process of improving each component: from quality parsing and well-thought-out chunking strategy to sophisticated search methods and response generation. Each stage contributes to the final quality of the system, and optimizing the entire chain can significantly increase the accuracy and usefulness of the answers.