SearchMuse Data Flow¶
This document describes how data flows through SearchMuse from initial user query to final research result. Understanding the data flow is crucial for debugging, extending, and optimizing the system.
High-Level Research Flow¶
sequenceDiagram
actor User
participant CLI
participant Orchestrator
participant LLM
participant Search
participant Scraper
participant Extractor
participant Repository
participant Renderer
User->>CLI: searchmuse "What is quantum computing?"
CLI->>Orchestrator: research(SearchQuery)
loop Iteration 1-5
Orchestrator->>LLM: generate_strategy(query, results)
LLM-->>Orchestrator: "Search for: quantum computing basics"
Orchestrator->>Search: search("quantum computing basics")
Search-->>Orchestrator: [Source, Source, Source]
par Parallel Scraping
Orchestrator->>Scraper: scrape(url1)
Orchestrator->>Scraper: scrape(url2)
Orchestrator->>Scraper: scrape(url3)
and Repository Update
Orchestrator->>Repository: save_source(source)
end
Scraper-->>Orchestrator: html_content
Orchestrator->>Extractor: extract_article(html)
Extractor-->>Orchestrator: ContentBlock
Orchestrator->>Repository: update_source(with_content)
Orchestrator->>LLM: assess_relevance(query, source)
LLM-->>Orchestrator: 0.95
alt Stop Condition Met?
Note over Orchestrator: Sufficient sources<br/>or max iterations
Orchestrator->>LLM: synthesize_result(sources, evidence)
LLM-->>Orchestrator: synthesis_text
break
end
end
Orchestrator->>Renderer: render(ResearchResult)
Renderer-->>Orchestrator: markdown_output
Orchestrator->>CLI: ResearchResult
CLI-->>User: Formatted output
Iteration-Level Data Flow¶
Each search iteration follows this cycle:
graph TD
A["SearchState<br/>iteration: N<br/>previous_results: []<br/>gathered_evidence: []"] --> B["Generate Strategy"]
B -->|LLMPort.generate_strategy| C["Strategy String<br/>Next search terms"]
C --> D["Execute Search"]
D -->|SearchPort.search| E["List[Source]<br/>title, url, summary"]
E --> F["Save Sources"]
F -->|SourceRepositoryPort.save_source| G["Sources in DB"]
G --> H["Scrape Each Source"]
H -->|ScraperPort.scrape| I["HTML Content"]
I --> J["Extract Article"]
J -->|ContentExtractorPort.extract_article| K["ContentBlock<br/>title, text, metadata"]
K --> L["Update Repository"]
L -->|SourceRepositoryPort.update_source| M["Source + Content in DB"]
M --> N["Assess Relevance"]
N -->|LLMPort.assess_relevance| O["Score: 0.0-1.0"]
O --> P{"Stop<br/>Condition?"}
P -->|No| Q["Update SearchState<br/>iteration += 1<br/>Gather evidence"]
Q --> A
P -->|Yes| R["Synthesize Result"]
R -->|LLMPort.synthesize_result| S["ResearchResult"]
Data Transformations¶
Stage 1: Query to Strategy¶
Input: SearchQuery
SearchQuery(
text="What is quantum computing?",
max_iterations=5,
timeout_seconds=300,
language="en"
)
Processing: 1. Validate query length and constraints 2. Pass to LLM with context from previous iteration 3. LLM generates refined search terms
Output: String
Stage 2: Strategy to Sources¶
Input: Strategy string
Processing: 1. Parse strategy terms 2. Execute search via SearchPort 3. Deduplicate against previously discovered sources 4. Return ranked source list
Output: List[Source]
[
Source(
url="https://quantum.ibm.com/",
title="IBM Quantum",
summary="IBM's quantum computing platform...",
relevance_score=0.87,
discovered_at=datetime.now()
),
# More sources...
]
Stage 3: Source to Content¶
Input: Source with url only
Processing: 1. Fetch HTML via ScraperPort (httpx or playwright) 2. Extract article via ContentExtractorPort (trafilatura) 3. Update source with content and metadata
Output: Source with extracted_content
Source(
url="https://quantum.ibm.com/",
title="IBM Quantum",
summary="...",
relevance_score=0.87,
discovered_at=datetime.now(),
extracted_content=ContentBlock(
text="Quantum computing leverages quantum mechanics...",
source_url="https://quantum.ibm.com/",
blocks=[
ContentBlock(text="Paragraph 1..."),
ContentBlock(text="Paragraph 2..."),
]
)
)
Stage 4: Content to Relevance Score¶
Input: SearchQuery + Source
Processing: 1. Format LLM prompt with query and source summary 2. LLM scores relevance (0.0 = irrelevant, 1.0 = perfect match) 3. Filter low-relevance sources (threshold: 0.6)
Output: Float (0.0-1.0)
Example Prompt:
Research query: "What is quantum computing?"
Source title: "IBM Quantum"
Source summary: "IBM's quantum computing platform and services"
Rate relevance from 0.0 to 1.0:
Example Response: 0.95
Stage 5: Evidence Gathering¶
Input: List[Source] with relevance scores
Processing: 1. Filter sources by relevance threshold 2. Extract content blocks from top sources 3. Chunk content into smaller blocks if needed 4. Organize chronologically and by relevance
Output: List[ContentBlock]
[
ContentBlock(
text="Quantum computing is a type of...",
source_url="https://quantum.ibm.com/",
metadata={"source": "IBM Quantum", "relevance": 0.95}
),
# More evidence blocks...
]
Stage 6: Synthesis¶
Input: - SearchQuery - List[Source] (curated) - List[ContentBlock] (evidence)
Processing: 1. Format LLM prompt with query and evidence 2. LLM generates comprehensive synthesis 3. Include citations in synthesis text
Output: String (synthesis)
Example Prompt:
Based on the following research evidence, provide a comprehensive
answer to the query.
Query: "What is quantum computing?"
Evidence:
1. [Source 1]: Quantum computing is...
2. [Source 2]: Key principles include...
Provide a well-sourced, detailed answer.
Stage 7: Result Rendering¶
Input: ResearchResult
Processing: 1. Format synthesis with rich markdown 2. Create numbered source list with URLs 3. Generate bibliography with proper citations 4. Include evidence blocks as appendix 5. Add execution metrics
Output: String (markdown)
# Research Result: What is quantum computing?
## Summary
Quantum computing is a paradigm shift in computation...
## Sources (12 found)
1. [IBM Quantum](https://quantum.ibm.com/) - Relevance: 0.95
...
## Full Citations
[1] IBM Quantum. (2024). Introduction to Quantum Computing...
...
## Evidence
[Evidence blocks with source citations]
---
Completed in 45.2 seconds | 5 iterations | 12 sources analyzed
Error Handling Flow¶
graph TD
A["Operation<br/>scrape/extract/search"] --> B{"Error<br/>Occurs?"}
B -->|No| C["Continue<br/>Normal Flow"]
B -->|Yes| D{"Retryable?"}
D -->|Timeout| E["Retry with<br/>backoff"]
D -->|Network| E
D -->|Rate Limit| E
D -->|Invalid Input| F["Log error<br/>Skip source"]
D -->|Max iterations| G["Attempt synthesis<br/>with current evidence"]
E --> H{"Retries<br/>Exhausted?"}
H -->|No| A
H -->|Yes| F
F --> I["Update SearchState<br/>with degraded result"]
G --> J["Return partial result"]
Performance Characteristics¶
Memory Usage¶
- Per iteration: ~5-10 MB (depends on extracted content size)
- Total: ~50-100 MB for 5 iterations × 10 sources
- Mitigated by streaming large content blocks
Network Bandwidth¶
- Per iteration: ~1-5 MB (10 sources × 100KB average)
- Scraping: ~50% of bandwidth
- LLM API calls: ~5% (small prompts)
Execution Time¶
- Per iteration: 30-60 seconds
- LLM strategy generation: 2-5s
- Search execution: 3-5s
- Scraping (parallel): 15-30s
- Content extraction: 5-10s
- Relevance assessment: 3-5s
Concurrency Model¶
SearchMuse uses async/await for concurrent operations:
# Parallel scraping of multiple sources
sources = [source1, source2, source3]
tasks = [scraper.scrape(s.url) for s in sources]
html_contents = await asyncio.gather(*tasks)
# Parallel relevance assessment
relevance_tasks = [
llm.assess_relevance(query, s) for s in sources
]
scores = await asyncio.gather(*relevance_tasks)
Benefits: - 3-5x faster than sequential execution - Efficient I/O utilization - Responsive CLI (no blocking)
State Persistence¶
SearchState is immutable and versioned:
# Iteration 0
state_v0 = SearchState(
query=query,
iteration=0,
previous_results=[],
gathered_evidence=[]
)
# Iteration 1
state_v1 = SearchState(
query=query,
iteration=1,
previous_results=sources_from_iteration_0,
gathered_evidence=evidence_from_iteration_0
)
This enables: - Easy rollback - Historical tracking - Replay capability - Debugging
Related Documentation¶
- Components Guide - Component interfaces
- Architecture Overview - Layer organization
- API Reference - Data class definitions
Last updated: 2026-02-28