SearchMuse Components¶

This document describes all major components in SearchMuse, their responsibilities, and interfaces. Understanding these components is essential for extending the system and contributing to the project.

Domain Components¶

SearchQuery¶

Represents a user research request.

@dataclass(frozen=True)
class SearchQuery:
    text: str                      # Research question
    max_iterations: int = 5        # Max refinement cycles
    timeout_seconds: int = 300     # Total timeout
    language: str = "en"           # Result language

Responsibilities: - Encapsulate user research intent - Validate input constraints (length, timeout) - Immutable by design

SearchState¶

Tracks research progress across iterations.

@dataclass(frozen=True)
class SearchState:
    query: SearchQuery
    iteration: int                 # Current iteration (0-based)
    previous_results: list[Source]
    gathered_evidence: list[ContentBlock]
    current_strategy: str          # Search strategy
    is_complete: bool = False

Responsibilities: - Maintain research context between iterations - Track evidence gathered so far - Determine when to stop searching - Support iteration logic

Source¶

Represents a discovered web source.

@dataclass(frozen=True)
class Source:
    url: str
    title: str
    summary: str
    relevance_score: float         # 0.0-1.0
    discovered_at: datetime
    extracted_content: ContentBlock | None = None

Responsibilities: - Record source metadata and location - Store relevance assessment - Link to extracted content - Support citation generation

Citation¶

Represents a formal reference to a source.

@dataclass(frozen=True)
class Citation:
    source: Source
    page_number: int | None = None
    accessed_at: datetime = field(default_factory=datetime.now)
    quote: str | None = None       # Relevant quote

Responsibilities: - Format citations (APA, Chicago, MLA) - Track access date for verification - Link to specific quotes - Support bibliography generation

ContentBlock¶

Represents extracted content from a source.

@dataclass(frozen=True)
class ContentBlock:
    text: str
    source_url: str
    blocks: list['ContentBlock'] = field(default_factory=list)  # Paragraphs
    metadata: dict[str, str] = field(default_factory=dict)

Responsibilities: - Store extracted article content - Maintain source reference - Preserve content hierarchy - Support content searching

ResearchResult¶

Final output of a research session.

@dataclass(frozen=True)
class ResearchResult:
    query: SearchQuery
    synthesis: str                 # AI-generated summary
    sources: list[Source]
    citations: list[Citation]
    evidence_blocks: list[ContentBlock]
    execution_time_seconds: float
    total_iterations: int

Responsibilities: - Aggregate research outcomes - Organize sources and citations - Include execution metrics - Support export and rendering

Port Interfaces¶

Ports are Python Protocol interfaces defining contracts with external services.

LLMPort¶

Strategy generation and result synthesis.

class LLMPort(Protocol):
    async def generate_strategy(
        self,
        query: SearchQuery,
        previous_results: list[Source],
        iteration: int
    ) -> str:
        """Generate next search strategy."""
        ...

    async def synthesize_result(
        self,
        query: SearchQuery,
        sources: list[Source],
        evidence: list[ContentBlock]
    ) -> str:
        """Synthesize final research summary."""
        ...

    async def assess_relevance(
        self,
        query: SearchQuery,
        source: Source
    ) -> float:
        """Score source relevance (0.0-1.0)."""
        ...

Implementations: - OllamaLLM (via Ollama, local models) - OpenAI (paid, requires API key) - AnthropicClaude (paid, requires API key)

ScraperPort¶

Web content retrieval.

class ScraperPort(Protocol):
    async def scrape(
        self,
        url: str,
        timeout_seconds: int = 10
    ) -> str:
        """Fetch page HTML/text."""
        ...

    async def scrape_with_javascript(
        self,
        url: str,
        timeout_seconds: int = 30
    ) -> str:
        """Fetch page after JS execution."""
        ...

    async def is_accessible(self, url: str) -> bool:
        """Check if URL is reachable."""
        ...

Implementations: - HttpxScraper (lightweight, no JS) - PlaywrightScraper (full browser, JS support)

ContentExtractorPort¶

Content parsing and extraction.

class ContentExtractorPort(Protocol):
    async def extract_article(
        self,
        html: str,
        source_url: str
    ) -> ContentBlock:
        """Extract article content."""
        ...

    async def extract_title(self, html: str) -> str:
        """Extract page title."""
        ...

    async def extract_summary(
        self,
        html: str,
        max_length: int = 200
    ) -> str:
        """Extract page summary."""
        ...

Implementations: - TrafilaturaExtractor (trafilatura library) - ReadabilityExtractor (readability-lxml library)

SourceRepositoryPort¶

Persistent source storage.

class SourceRepositoryPort(Protocol):
    async def save_source(self, source: Source) -> None:
        """Store discovered source."""
        ...

    async def find_source(self, url: str) -> Source | None:
        """Retrieve source by URL."""
        ...

    async def list_sources(
        self,
        query_text: str,
        limit: int = 100
    ) -> list[Source]:
        """List sources for query."""
        ...

    async def update_source(self, source: Source) -> None:
        """Update existing source."""
        ...

Implementations: - SQLiteRepository (aiosqlite, file-based) - PostgresRepository (async psycopg, server-based)

SearchPort¶

Search engine integration.

class SearchPort(Protocol):
    async def search(
        self,
        query: str,
        max_results: int = 10,
        language: str = "en"
    ) -> list[Source]:
        """Execute search query."""
        ...

    async def search_similar(
        self,
        source_url: str,
        max_results: int = 5
    ) -> list[Source]:
        """Find similar sources."""
        ...

Implementations: - DuckDuckGoSearch (privacy-respecting) - GoogleSearch (requires API key) - BingSearch (requires API key)

ResultRendererPort¶

Output formatting and presentation.

class ResultRendererPort(Protocol):
    async def render_result(
        self,
        result: ResearchResult
    ) -> str:
        """Format result for display."""
        ...

    async def render_sources(
        self,
        sources: list[Source],
        format: str = "markdown"
    ) -> str:
        """Format sources list."""
        ...

    async def render_citations(
        self,
        citations: list[Citation],
        style: str = "apa"
    ) -> str:
        """Format bibliography."""
        ...

Implementations: - MarkdownRenderer (markdown output) - HTMLRenderer (HTML with CSS) - JSONRenderer (machine-readable)

Component Interaction¶

graph LR
    USER["User Query"]
    CLI["CLI Layer"]
    ORCH["ResearchOrchestrator"]
    STRAT["StrategyEngine"]
    STATE["SearchState"]

    LLM["LLMPort"]
    SEARCH["SearchPort"]
    SCRAPER["ScraperPort"]
    EXTRACT["ContentExtractorPort"]
    REPO["SourceRepositoryPort"]
    RENDER["ResultRendererPort"]

    USER --> CLI
    CLI --> ORCH
    ORCH --> STATE
    ORCH --> STRAT
    STRAT --> LLM

    ORCH --> SEARCH
    SEARCH -.-> REPO
    ORCH --> SCRAPER
    SCRAPER --> EXTRACT
    EXTRACT -.-> REPO

    ORCH --> RENDER
    RENDER --> CLI
    CLI --> USER

Adapter Implementations¶

OllamaLLM¶

File: src/searchmuse/adapters/ollama_llm.py

Implements LLMPort using Ollama local models.

Configuration: - ollama.base_url: Ollama server URL (default: http://localhost:11434) - ollama.model: Model name (default: mistral) - ollama.timeout_seconds: Request timeout

Features: - Prompt engineering for task-specific strategies - Temperature control for consistency vs. creativity - Context window awareness

HttpxScraper¶

File: src/searchmuse/adapters/httpx_scraper.py

Lightweight HTTP-based scraping via httpx.

Configuration: - scraper.user_agent: User-Agent header - scraper.timeout_seconds: Request timeout - scraper.max_redirects: Redirect following limit

Features: - Async request handling - Connection pooling - Automatic retry on failure

PlaywrightScraper¶

File: src/searchmuse/adapters/playwright_scraper.py

Full browser automation for JavaScript-heavy sites.

Configuration: - playwright.browser: chromium, firefox, or webkit - playwright.headless: Run without GUI - playwright.timeout_seconds: Navigation timeout

Features: - JavaScript execution - Form interaction - Screenshot capability

TrafilaturaExtractor¶

File: src/searchmuse/adapters/trafilatura_extractor.py

Content extraction via trafilatura library.

Features: - Article body extraction - Metadata recovery (title, date, author) - Table and code block preservation

SQLiteRepository¶

File: src/searchmuse/adapters/sqlite_repository.py

File-based source storage via aiosqlite.

Database Schema: - sources table: url, title, summary, relevance_score, discovered_at - content_blocks table: source_id, text, metadata

Features: - Async operations - Index on URL for fast lookup - Automatic schema creation

MarkdownRenderer¶

File: src/searchmuse/adapters/markdown_renderer.py

Markdown output with rich formatting.

Output Format:

# Research Result: [Query]

## Summary
[Synthesis]

## Sources (5 found)
1. [Title](URL) - [Relevance Score]
   Summary: [Summary]

## Full Citations
[APA-formatted bibliography]

## Evidence
[Extracted content blocks with quotes]

Features: - Citation formatting - Table generation - Link preservation

Architecture Overview - Overall system design
Data Flow - Component interactions during execution
API Reference - Complete class and method definitions
Contributing Guide - Adding new adapters

Last updated: 2026-02-28