SearchMuse Feature Specifications¶
Feature Architecture Overview¶
graph TB
A[User Query] --> B[Query Parser]
B --> C[LLM Strategy Generator]
C --> D[Search Engine]
D --> E[Web Scraper]
E --> F[Content Extractor]
F --> G[LLM Relevance Assessor]
G --> H[Coverage Assessor]
H --> I{Converged?}
I -->|No| C
I -->|Yes| J[Result Synthesizer]
J --> K[Citation Formatter]
K --> L[Final Output]
Feature 1: Iterative Search¶
Overview¶
SearchMuse implements an intelligent search refinement loop that automatically improves result coverage through multiple iterations without user intervention.
Workflow¶
- Query Normalization: Parse user input, extract key terms and intent
- Strategy Generation: LLM analyzes query and generates search strategy
- Primary search terms and synonyms
- Domain suggestions (academic, technical, news, etc.)
- Search order and priority
- Execute Search: Query DuckDuckGo with generated terms
- Content Extraction: Scrape top 10-20 results
- Relevance Assessment: LLM evaluates each source for relevance to original query
- Coverage Assessment: LLM determines if sources adequately address query
- Convergence Check: If coverage sufficient, move to synthesis
- Strategy Refinement: If not converged, LLM performs gap analysis and generates refined strategy
- Repeat: Execute iterations 3-8 until convergence or max iterations
Configuration¶
search:
max_iterations: 5
min_sources: 5
coverage_threshold: 0.7 # 0.0-1.0
results_per_query: 15
timeout_per_source: 10s # seconds
Outputs per Iteration¶
- List of sources retrieved
- Relevance scores (0.0-1.0)
- Coverage assessment
- Identified gaps for next iteration
Feature 2: Source Citation¶
Philosophy¶
Every claim in SearchMuse output is traceable to its source. Citations are integral to the system, not an afterthought.
Citation Data Model¶
Citation(
index: int, # Reference number [1], [2], etc.
source_id: str, # Unique identifier
url: str, # Full URL
title: str, # Page title
author: str | None, # Author if available
publication_date: str | None, # ISO 8601 format
access_date: str, # When SearchMuse accessed it
excerpt: str | None # Optional relevant quote
)
Citation Formats¶
Markdown (Default)¶
This is a claim[1] supported by evidence[2].
## References
[1] "Page Title", Author Name, https://example.com/page
[2] "Another Page", https://example.org/article
HTML¶
<p>This is a claim<sup><a href="#ref1">[1]</a></sup>.</p>
<ol id="references">
<li id="ref1"><a href="https://example.com">Page Title</a>, Author</li>
</ol>
APA-Style¶
This is a claim (Author, 2024).
References
Author, A. A. (2024). Page title. Retrieved from https://example.com
Citation Extraction Process¶
- As content is extracted from each source, citation metadata is captured
- LLM identifies specific claims from source content
- Each claim mapped to citation index
- Citation list compiled with full metadata
- Output formatter applies selected citation style
Feature 3: Content Extraction¶
Primary Strategy: Trafilatura¶
- Extracts main content from HTML
- Removes boilerplate, ads, navigation
- Preserves text structure
- Fast and lightweight
Fallback Strategy: Readability-lxml¶
- Alternative extraction engine for sites trafilatura struggles with
- Uses browser-like content classification
- Slower but more reliable on complex layouts
Extraction Pipeline¶
def extract_content(html: str) -> ExtractedContent:
# Attempt trafilatura extraction
content = trafilatura.extract(html)
if not content or len(content) < MIN_WORDS:
# Fallback to readability
content = readability_extract(html)
return ExtractedContent(
main_text=content,
title=extract_title(html),
author=extract_author(html),
publish_date=extract_date(html)
)
Quality Metrics¶
- Minimum content length: 100 words (configurable)
- Text-to-HTML ratio: >0.15 (not mostly markup)
- Encoding detection: UTF-8 or auto-detected
Feature 4: LLM Integration via Ollama¶
Model Options¶
- mistral (default): Balanced speed/quality, ~7B parameters
- llama3: Better reasoning, ~13B parameters
- phi3: Smaller/faster, ~3.8B parameters
LLM Tasks¶
Strategy Generation¶
Input: User query, previous search results (if iterating) Output: List of search terms, domain preferences, search order Temperature: 0.7 (creative but focused)
Relevance Assessment¶
Input: Query, source content Output: Relevance score 0.0-1.0, brief justification Temperature: 0.3 (deterministic)
Coverage Assessment¶
Input: Query, all retrieved sources and their content Output: Coverage score 0.0-1.0, identified gaps Temperature: 0.3 (deterministic)
Result Synthesis¶
Input: Query, all sources, relevance scores Output: Coherent answer with inline citations Temperature: 0.5 (balanced)
Configuration¶
llm:
provider: ollama
model: mistral
base_url: http://localhost:11434
timeout: 60s
temperature:
strategy: 0.7
relevance: 0.3
coverage: 0.3
synthesis: 0.5
Feature 5: Multi-Strategy Scraping¶
HTTP Scraping (httpx)¶
- Used for static HTML sites
- Fast and resource-efficient
- Respects robots.txt
Dynamic Scraping (Playwright)¶
- Used for JavaScript-heavy sites
- Waits for content rendering
- More resource-intensive
Strategy Selection¶
def select_scraper(url: str) -> ScraperType:
# Check robots.txt first
if blocked_by_robots_txt(url):
return ScraperType.BLOCKED
# Heuristics: common JS frameworks indicate need for Playwright
if likely_js_heavy(url):
return ScraperType.PLAYWRIGHT
return ScraperType.HTTPX
robots.txt Compliance¶
- Check robots.txt before scraping
- Respect Disallow rules for user-agent "searchmuse"
- Rate limiting: 1 second between requests to same domain
- User-Agent string identifies SearchMuse
Feature 6: Result Synthesis¶
Synthesis Process¶
- LLM receives all retrieved sources and their content
- LLM generates coherent answer addressing original query
- Inline citations added as references are made
- Citation list compiled
- Output formatted in selected style
Quality Assurance¶
- Verify all citations referenced in text exist
- Check for hallucinated sources (LLM-invented references)
- Validate source URLs are functional
- Ensure comprehensive coverage of query intent
Example Output¶
# Research Results: Zero-Knowledge Proofs
Zero-knowledge proofs (ZKPs) are cryptographic protocols that allow one party
to prove they know a fact without revealing the fact itself[1]. This technique
has applications in blockchain, privacy-preserving authentication, and more[2].
## Key Applications
ZKPs are increasingly used in blockchain systems for transaction privacy[3] and
in authentication systems for password-free login[4].
## References
[1] "Zero-Knowledge Proof", Wikipedia, https://en.wikipedia.org/wiki/Zero-knowledge_proof
[2] "Understanding Zero-Knowledge Proofs", Author Name, https://example.com/zk-guide
[3] "Privacy in Blockchain", Journal of Cryptography, https://example.com/zk-blockchain
[4] "Zero-Knowledge Authentication", Security Today, https://example.com/auth-zk