Skip to content

SearchMuse LLM Requirements

Overview

SearchMuse uses local language models via Ollama to handle intelligent tasks without sending data to external services. This document covers model selection, installation, configuration, and optimization.

Ollama Installation

Prerequisites

  • Docker or native installation support
  • For GPU acceleration: NVIDIA CUDA toolkit (optional but recommended)
  • 8GB RAM minimum (16GB recommended)

Installation Steps

# Pull Ollama image
docker pull ollama/ollama

# Run Ollama service
docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

# Pull a model (see model selection below)
docker exec ollama ollama pull mistral

Option 2: Native Installation

# macOS
brew install ollama

# Ubuntu/Linux
curl https://ollama.ai/install.sh | sh

# Windows
# Download installer from https://ollama.ai

# Start service
ollama serve

Verify Installation

# Check Ollama is running
curl http://localhost:11434/api/tags

# Response example:
# {"models":[{"name":"mistral:latest","size":4109298059}]}

Model ID: mistral Size: 7.3B parameters Recommended For: Production use, balanced speed and quality

Strengths: - Fast inference (2-5 tokens/second) - Good reasoning and code understanding - Excellent English fluency - Low resource requirements - Great cost-performance ratio

Requirements: - RAM: 8GB minimum, 16GB recommended - Storage: 5GB - GPU: Optional (4GB VRAM if available) - Inference time: 2-5 seconds per task

Installation:

ollama pull mistral

Configuration:

llm:
  model: mistral
  parameters:
    temperature: 0.7
    top_p: 0.9
    top_k: 40

Best For: - General research queries - First-time users - Resource-constrained environments - Production deployments


Model 2: Llama 3 (Large - Better Reasoning)

Model ID: llama3 or llama3:70b Size: 8B or 70B parameters Recommended For: Complex analysis, deep reasoning required

Strengths: - Superior reasoning capabilities - Better handling of complex queries - Excellent instruction following - Strong performance on nuanced tasks - Large context window (8K tokens)

Requirements: - RAM: 16GB minimum, 32GB recommended - Storage: 7GB (8B) or 40GB (70B) - GPU: Strongly recommended (8GB+ VRAM) - Inference time: 3-8 seconds per task

Installation:

# 8B version (7GB)
ollama pull llama3

# 70B version (40GB, requires powerful GPU)
ollama pull llama3:70b

Configuration:

llm:
  model: llama3
  parameters:
    temperature: 0.5
    top_p: 0.95
    num_ctx: 8192

Best For: - Complex research topics - Multi-document synthesis - Detailed gap analysis - Academic research


Model 3: Phi 3 (Lightweight - Fastest)

Model ID: phi3 or phi3.5 Size: 3.8B parameters Recommended For: Fast iteration, resource-limited systems

Strengths: - Very fast inference (5-10 tokens/second) - Minimal resource usage - Good quality for size - Excellent for constrained environments - Can run on older hardware

Requirements: - RAM: 4GB minimum, 8GB recommended - Storage: 2.5GB - GPU: Optional - Inference time: 1-2 seconds per task

Installation:

ollama pull phi3

Configuration:

llm:
  model: phi3
  parameters:
    temperature: 0.7
    top_p: 0.9

Best For: - Quick fact-finding - Simple queries - Laptops and budget systems - Real-time interactive use


Hardware Requirements by Model

Model Model Size RAM Storage GPU Speed
phi3 3.8B 4GB 2.5GB None 5-10 t/s
mistral 7.3B 8GB 5GB Optional 2-5 t/s
llama3 8B 16GB 7GB 8GB 3-8 t/s
llama3 70B 32GB+ 40GB 24GB+ 1-3 t/s

GPU Acceleration

GPU acceleration significantly improves performance:

llm:
  use_gpu: true
  gpu_layers: "all"  # Offload all layers to GPU
  main_gpu: 0        # Which GPU to use (0 = first)

Speedup by GPU: - NVIDIA RTX 3080 (10GB): 3-5x speedup - NVIDIA RTX 4090 (24GB): 5-10x speedup - Apple M1/M2 (GPU): 2-3x speedup


Temperature and Parameters

Temperature controls output creativity and randomness.

Temperature Settings by Task

Strategy Generation (temperature: 0.7)

Creative exploration of search strategies. Medium randomness encourages diverse approaches.

strategy_generation:
  temperature: 0.7
  top_p: 0.9
  top_k: 40

Example prompt:

Generate diverse search strategies for: "machine learning interpretability"
Include unconventional approaches.

Relevance Assessment (temperature: 0.3)

Deterministic evaluation of source relevance. Low temperature ensures consistent scoring.

relevance_assessment:
  temperature: 0.3
  top_p: 0.95
  top_k: 100

Example prompt:

Score this source's relevance to the query on 0.0-1.0 scale.
Be strict and consistent.

Coverage Assessment (temperature: 0.2)

Strict evaluation of research completeness. Very low temperature for consistency.

coverage_assessment:
  temperature: 0.2
  top_p: 0.95
  top_k: 100

Result Synthesis (temperature: 0.5)

Balance between creativity and consistency. Moderate temperature for readable prose.

result_synthesis:
  temperature: 0.5
  top_p: 0.9
  top_k: 50

Prompt Templates

Strategy Generation Prompt

You are a research strategy expert. Given a query, generate an optimal search strategy.

Query: "{query}"

{history}

Generate a JSON response with:
{
  "search_terms": ["term1", "term2", ...],
  "domain_preferences": ["site:github.com", "-site:pinterest.com", ...],
  "rationale": "Why this strategy..."
}

Be specific and actionable.

Relevance Assessment Prompt

Assess source relevance to the query.

Query: "{query}"

Source Title: "{title}"
Source Content (first 500 words):
{content}

On a 0.0-1.0 scale, rate relevance. Consider:
- Direct answer to query
- Authority and expertise
- Recency (for time-sensitive topics)
- Clarity and comprehensiveness

Return JSON:
{
  "relevance_score": 0.X,
  "explanation": "..."
}

Be strict. Score <= 0.4 for tangentially related content.

Coverage Assessment Prompt

Assess research coverage completeness.

Original Query: "{query}"

Retrieved {num_sources} sources:
{sources_summary}

Evaluate coverage (0.0-1.0). Consider:
- All major subtopics addressed
- Diverse perspectives included
- Sufficient depth per topic
- Currency of sources

Identify gaps:
{
  "coverage_score": 0.X,
  "gaps": ["gap1", "gap2", ...],
  "confidence": 0.X
}

Be critical. 0.7 = adequate, 0.9+ = comprehensive.

Result Synthesis Prompt

Synthesize research into a coherent answer.

Query: "{query}"

Sources (with content):
{all_sources}

Generate a comprehensive answer that:
1. Directly addresses the query
2. Uses information from multiple sources
3. Cites sources as [1], [2], etc.
4. Flows naturally and logically
5. Includes specific examples and data

Format as markdown with inline citations.

Custom Model Configuration

Use a Different Model

To use a model other than the defaults:

  1. Pull the model:

    ollama pull neural-chat  # or any other model
    

  2. Update configuration:

    llm:
      model: neural-chat
      base_url: http://localhost:11434
      timeout: 60
    

  3. Test the model:

    # Verify model is available
    ollama list
    
    # Manual test
    curl http://localhost:11434/api/generate \
      -d '{
        "model": "neural-chat",
        "prompt": "Hello!",
        "stream": false
      }'
    

Available Models on Ollama Library

Mistral family:
  - mistral (7B, recommended)
  - mistral-medium
  - neural-chat (fine-tuned mistral)

Llama family:
  - llama3 (8B and 70B)
  - llama2 (7B, 13B, 70B)
  - llama2-uncensored

Other popular:
  - dolphin-mixtral
  - neural-chat
  - zephyr
  - orca-mini
  - solar

Full list: https://ollama.ai/library


Fine-Tuning (Advanced)

For specialized use cases, fine-tune a model:

# Create training data in JSONL format
# {"prompt": "...", "response": "..."}

# Fine-tune using Ollama extensions
# (Requires additional setup)
ollama create custom-model --context models/base --train data.jsonl

Fine-tuning is optional and advanced. Start with base models.


Performance Optimization

Batch Processing

Process multiple requests efficiently:

def process_batch(queries: List[str]) -> List[str]:
    """Process multiple queries in parallel."""

    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        futures = [
            executor.submit(llm.generate, query)
            for query in queries
        ]
        return [f.result() for f in futures]

Caching Responses

Reuse LLM responses for identical inputs:

llm:
  cache:
    enabled: true
    ttl: 3600  # seconds
    backend: sqlite  # or redis

Streaming Responses

For interactive use, stream token-by-token:

def generate_streaming(prompt: str):
    for token in llm.generate_stream(prompt):
        yield token
        print(token, end="", flush=True)

Troubleshooting

Model Not Responding

# Check if Ollama service is running
curl http://localhost:11434/api/tags

# If fails, restart Ollama
ollama serve

# Verify model is installed
ollama list

Out of Memory

# Use smaller model
ollama pull mistral  # instead of llama3

# Reduce batch size
llm:
  batch_size: 1

# Reduce context size
llm:
  context_size: 2048  # instead of 8192

Slow Performance

# Enable GPU acceleration
llm:
  use_gpu: true

# Use faster model
ollama pull phi3  # instead of mistral

# Reduce precision
llm:
  quantization: q4  # instead of q8

Monitoring and Logging

logging:
  llm_requests: true
  log_level: INFO
  log_file: logs/llm.log

monitoring:
  track_latency: true
  track_token_usage: true
  track_memory: true

View logs:

tail -f logs/llm.log | grep "strategy_generation"


Configuration Template

# searchmuse_config.yaml
llm:
  # Model selection
  provider: ollama
  model: mistral  # or llama3, phi3
  base_url: http://localhost:11434

  # Connection
  timeout: 60
  max_retries: 3
  retry_delay: 1

  # Performance
  use_gpu: true
  gpu_layers: all
  batch_size: 1

  # Temperature by task
  temperatures:
    strategy: 0.7
    relevance: 0.3
    coverage: 0.2
    synthesis: 0.5

  # Caching
  cache:
    enabled: true
    backend: sqlite
    ttl: 3600

  # Logging
  logging:
    enabled: true
    level: INFO

Use this configuration to customize SearchMuse for your environment.