SearchMuse Configuration Reference¶

Complete guide to configuring SearchMuse behavior through YAML configuration files and environment variables.

Configuration Loading¶

SearchMuse loads configuration in the following precedence order (highest to lowest):

Environment variables (SEARCHMUSE_* prefix)
Custom YAML file (--config parameter or config/ directory)
Default YAML file (config/default.yaml)

This allows environment variables to override file settings, enabling containerized deployments.

Default Configuration¶

File: config/default.yaml

Contains sensible defaults for all settings. Never modify this file; create a custom config instead.

Configuration Structure¶

All configuration is organized into sections:

search:
  # Search engine settings
llm:
  # Language model settings
scraper:
  # Web scraping settings
extraction:
  # Content extraction settings
repository:
  # Data storage settings
rendering:
  # Output formatting settings
timeouts:
  # Operation timeouts
limits:
  # Resource limits
logging:
  # Logging configuration

Search Configuration¶

Section: `search`¶

search:
  # Engine to use: 'duckduckgo' | 'google' | 'bing'
  engine: duckduckgo

  # Results per search query
  results_per_query: 10

  # Maximum results across all iterations
  max_total_results: 100

  # Relevance threshold for filtering (0.0-1.0)
  relevance_threshold: 0.6

  # Language code for search results
  language: en

  # Respect robots.txt
  respect_robots_txt: true

  # Rate limit: milliseconds between requests to same domain
  rate_limit_ms: 1000

Environment Variables: - SEARCHMUSE_SEARCH_ENGINE = duckduckgo - SEARCHMUSE_SEARCH_RESULTS_PER_QUERY = 10 - SEARCHMUSE_SEARCH_RELEVANCE_THRESHOLD = 0.6

LLM Configuration¶

Section: `llm`¶

llm:
  # Provider: 'ollama' | 'openai' | 'anthropic'
  provider: ollama

  # Model name/identifier
  model: mistral

  # Ollama configuration (if provider: ollama)
  ollama:
    base_url: http://localhost:11434
    timeout_seconds: 60

  # Temperature for generation (0.0-2.0)
  # Lower = more deterministic, higher = more creative
  temperature: 0.7

  # Maximum tokens for strategy generation
  max_tokens_strategy: 100

  # Maximum tokens for synthesis
  max_tokens_synthesis: 1000

  # System prompt for strategy generation
  strategy_prompt: >
    You are a research assistant helping refine search strategies.
    Based on the research query and previous results, suggest specific
    search terms to find more relevant sources. Keep suggestions concise.

  # System prompt for synthesis
  synthesis_prompt: >
    You are a research synthesizer. Based on the provided sources and
    evidence, create a comprehensive answer to the research question.
    Include citations from sources. Format in markdown.

Environment Variables: - SEARCHMUSE_LLM_PROVIDER = ollama - SEARCHMUSE_LLM_MODEL = mistral - SEARCHMUSE_LLM_TEMPERATURE = 0.7 - SEARCHMUSE_OLLAMA_BASE_URL = http://localhost:11434

Ollama Model Selection: - mistral (7B) - Fast, good quality, recommended for CPU - neural-chat (7B) - Conversation-optimized - llama2 (7B/13B) - General purpose - mixtral (8x7B) - High quality, requires more RAM

Scraper Configuration¶

Section: `scraper`¶

scraper:
  # Default scraper: 'httpx' | 'playwright'
  default: httpx

  # Playwright browser for JS-heavy sites: 'chromium' | 'firefox' | 'webkit'
  browser: chromium

  # Run browser headless (no GUI)
  headless: true

  # Custom User-Agent header
  user_agent: >
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
    (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

  # Request timeout per URL (seconds)
  timeout_seconds: 10

  # JavaScript rendering timeout (seconds)
  javascript_timeout_seconds: 30

  # Maximum redirects to follow
  max_redirects: 5

  # Enable cookies
  enable_cookies: true

  # Accept gzip compression
  accept_encoding: gzip, deflate

  # Retry failed requests
  retry_attempts: 2
  retry_backoff_factor: 1.5

Environment Variables: - SEARCHMUSE_SCRAPER_DEFAULT = httpx - SEARCHMUSE_SCRAPER_TIMEOUT_SECONDS = 10 - SEARCHMUSE_SCRAPER_BROWSER = chromium

Extraction Configuration¶

Section: `extraction`¶

extraction:
  # Extractor: 'trafilatura' | 'readability'
  engine: trafilatura

  # Include tables in extraction
  include_tables: true

  # Include code blocks in extraction
  include_code: true

  # Include images metadata
  include_images: false

  # Maximum content length (chars)
  max_length: 100000

  # Minimum content length to consider article
  min_length: 500

  # Include comments
  include_comments: false

  # Language detection
  detect_language: true

Environment Variables: - SEARCHMUSE_EXTRACTION_ENGINE = trafilatura - SEARCHMUSE_EXTRACTION_MAX_LENGTH = 100000

Repository Configuration¶

Section: `repository`¶

repository:
  # Storage type: 'sqlite' | 'postgres'
  type: sqlite

  # SQLite configuration
  sqlite:
    # Database file path
    path: ./data/searchmuse.db

    # Enable WAL (Write-Ahead Logging) for better concurrency
    journal_mode: wal

    # Synchronous mode: 0 (FULL), 1 (NORMAL), 2 (SYNC)
    synchronous: 1

  # PostgreSQL configuration (if type: postgres)
  postgres:
    host: localhost
    port: 5432
    database: searchmuse
    user: searchmuse
    password: ${DB_PASSWORD}  # Load from env var

  # Automatic cleanup of old sources (days)
  cleanup_older_than_days: 90

  # Maximum stored sources per query
  max_sources_per_query: 1000

Environment Variables: - SEARCHMUSE_REPOSITORY_TYPE = sqlite - SEARCHMUSE_REPOSITORY_SQLITE_PATH = ./data/searchmuse.db - SEARCHMUSE_REPOSITORY_POSTGRES_HOST = localhost - DB_PASSWORD = (for postgres password)

Rendering Configuration¶

Section: `rendering`¶

rendering:
  # Output format: 'markdown' | 'html' | 'json'
  format: markdown

  # Markdown settings
  markdown:
    # Include table of contents
    include_toc: true

    # Include execution time
    include_metrics: true

    # Citation format: 'apa' | 'chicago' | 'mla'
    citation_format: apa

    # Maximum sources to display
    max_sources: 50

  # HTML settings
  html:
    # Include CSS styling
    include_css: true

    # Dark mode support
    dark_mode: false

    # Mobile responsive
    responsive: true

  # JSON settings
  json:
    # Pretty-print JSON
    indent: 2

    # Include schema
    include_schema: false

Environment Variables: - SEARCHMUSE_RENDERING_FORMAT = markdown - SEARCHMUSE_RENDERING_MARKDOWN_CITATION_FORMAT = apa

Timeout Configuration¶

Section: `timeouts`¶

timeouts:
  # Total research execution timeout (seconds)
  total_research: 300

  # Per-iteration timeout (seconds)
  per_iteration: 60

  # Per-search timeout (seconds)
  search: 15

  # Per-scrape timeout (seconds)
  scrape: 10

  # Content extraction timeout (seconds)
  extraction: 10

  # LLM request timeout (seconds)
  llm: 60

  # Database operation timeout (seconds)
  database: 5

Environment Variables: - SEARCHMUSE_TIMEOUTS_TOTAL_RESEARCH = 300 - SEARCHMUSE_TIMEOUTS_SCRAPE = 10

Limits Configuration¶

Section: `limits`¶

limits:
  # Maximum iterations per research session
  max_iterations: 5

  # Minimum iterations before allowing stop
  min_iterations: 1

  # Maximum query length (characters)
  max_query_length: 1000

  # Maximum sources per iteration
  max_sources_per_iteration: 20

  # Maximum content block size (characters)
  max_block_size: 10000

  # Maximum concurrent scraping operations
  max_concurrent_scrapes: 5

  # Maximum concurrent extractions
  max_concurrent_extractions: 3

Environment Variables: - SEARCHMUSE_LIMITS_MAX_ITERATIONS = 5 - SEARCHMUSE_LIMITS_MAX_QUERY_LENGTH = 1000

Logging Configuration¶

Section: `logging`¶

logging:
  # Log level: DEBUG | INFO | WARNING | ERROR | CRITICAL
  level: INFO

  # Log file path (optional)
  file: ./logs/searchmuse.log

  # Maximum log file size (MB) before rotation
  max_file_size_mb: 50

  # Number of backup log files to keep
  backup_count: 5

  # Log format
  format: >
    %(asctime)s - %(name)s - %(levelname)s - %(message)s

  # Log to console
  console: true

  # Modules to debug (more verbose)
  debug_modules:
    # - searchmuse.adapters.ollama_llm
    # - searchmuse.adapters.httpx_scraper

Environment Variables: - SEARCHMUSE_LOGGING_LEVEL = INFO - SEARCHMUSE_LOGGING_FILE = ./logs/searchmuse.log

Example Configurations¶

Minimal Configuration¶

For basic usage with defaults:

# config/minimal.yaml
llm:
  provider: ollama
  model: mistral

search:
  engine: duckduckgo

High-Performance Configuration¶

Optimized for speed with more concurrent operations:

# config/performance.yaml
limits:
  max_concurrent_scrapes: 10
  max_concurrent_extractions: 5

timeouts:
  total_research: 180
  per_iteration: 40

search:
  results_per_query: 20
  max_total_results: 50

llm:
  temperature: 0.5  # More deterministic

Privacy-Focused Configuration¶

Minimal external dependencies:

# config/privacy.yaml
search:
  engine: duckduckgo
  respect_robots_txt: true
  rate_limit_ms: 2000

scraper:
  default: httpx
  user_agent: Mozilla/5.0 SearchMuse Research Bot

repository:
  type: sqlite
  sqlite:
    path: ./local_data/searchmuse.db
  cleanup_older_than_days: 30

Production Configuration¶

Suitable for server deployments:

# config/production.yaml
repository:
  type: postgres
  postgres:
    host: ${DB_HOST}
    port: ${DB_PORT}
    database: searchmuse_prod
    user: ${DB_USER}
    password: ${DB_PASSWORD}

logging:
  level: WARNING
  file: /var/log/searchmuse/research.log
  backup_count: 10

limits:
  max_iterations: 3
  max_sources_per_iteration: 10

search:
  rate_limit_ms: 2000

Loading Custom Configuration¶

Via Command Line¶

searchmuse --config config/custom.yaml research "quantum computing"

Via Environment Variable¶

export SEARCHMUSE_CONFIG=config/custom.yaml
searchmuse research "quantum computing"

Programmatically¶

from searchmuse.infrastructure.config import ConfigLoader

config = ConfigLoader.from_file("config/custom.yaml")
# Use config for initialization

Validation¶

SearchMuse validates configuration on startup:

All paths must be absolute or relative to project root
Timeouts must be positive integers
Limits must be positive integers
Temperatures must be 0.0-2.0
Relevance threshold must be 0.0-1.0

Invalid configuration raises ConfigurationError with details.

Development Setup - Initial configuration
Deployment Guide - Production configuration
Components Guide - Component-specific config

Last updated: 2026-02-28