ReAG (Reasoning-Augmented Generation)
Author: evan
Version: 0.0.1
Overview
ReAG is a method that skips the retrieval step entirely. Instead of preprocessing documents into searchable snippets, ReAG feeds raw materials—text files, web pages, spreadsheets—directly to the language model. The model then decides what matters and why, synthesizing answers in one go.
Traditional RAG vs ReAG
Traditional RAG System
Traditional RAG systems typically operate in two phases:
- Semantic Search Phase: First uses retrieval techniques to select documents that are superficially similar to the query content
- Generation Phase: Then uses language models to generate answers from these documents
However, this two-phase approach may overlook deeper contextual relationships in documents and potentially introduce irrelevant content.
ReAG's Unified Strategy
ReAG adopts a unified approach:
- It passes raw document content directly to the language model, allowing the model to independently evaluate and integrate the complete context
- This method produces more accurate, detailed answers that better reflect complex contextual relationships
How ReAG Cuts Through the Noise
ReAG operates on a simple idea: let the language model do the heavy lifting. Instead of relying on pre-built indexes or embeddings, ReAG hands the model raw documents and asks two questions:
- Is this document useful for the task?
- What specific parts of it matter?
Example
If you ask, "Why are polar bear populations declining?" a traditional RAG system might fetch documents containing phrases like "Arctic ice melt" or "bear habitats." But ReAG goes further. It scans entire documents, considering their full context and meaning rather than just their semantic similarity to the query. A research paper titled "Thermal Dynamics of Sea Ice" might be ignored—unless the model notices a section linking ice loss to disruptions in bear feeding patterns.
This approach mirrors how humans research: we skim sources, discard irrelevant ones, and focus on passages that address our specific question. ReAG replicates this behavior programmatically, using the model's ability to infer connections rather than relying on superficial semantics.
Understanding the Difference
Traditional RAG
Operates like a librarian:
- Indexes books (documents) by summarizing their covers (embeddings)
- Uses those summaries to guess which books might answer your question
- Process is fast but reductive—prioritizes lexical proximity over functional utility
ReAG
Acts like a scholar:
- Reads every book in full
- Underlines relevant paragraphs
- Synthesizes insights based on the query's deeper intent
Technical Implementation
Key Parameters in reag.yaml
The following parameters are utilized in reag.py:
model
- Determines the language model used for answer generation
- Configured through LLMModelConfig with provider, model name, and operation mode
query
- User input string driving the answer generation process
- Combined with system prompts (e.g., REAG_SYSTEM_PROMPT) to create complete prompt messages
files
- Optional parameter for file uploads
- File content processed using MarkItDown module for text extraction
Workflow
- Tool trigger: reag.py's _invoke method receives parameters from reag.yaml (model, query, files)
- File processing: Creates temporary files and converts content to text using MarkItDown
- Parallel processing: Uses ThreadPoolExecutor for document processing
- Response structure: Returns JSON object containing:
- content: Generated answer
- reasoning: Explanation of reasoning process
- is_irrelevant: Boolean indicating relevance
- document: Original document info (name and content)
Trade-offs and Considerations
Challenges
- Cost: Processing entire documents with LLMs is more expensive than vector search
- Speed: Can struggle with massive datasets despite parallelization
Ideal Use Cases
- Complex, open-ended queries
- Dynamic data (news, research repositories)
- Multimodal data (images, tables, charts)
Future Prospects
Key Trends
-
Cheaper, faster language models
- Improvement in open-source models (Llama, DeepSeek)
- Advancement in quantization techniques
-
Larger context windows
- Expanding from millions to billions of tokens
- Enhanced document processing capabilities
-
Hybrid systems
- Combining lightweight embedding filters with ReAG
- Balancing speed and accuracy
Resources