auto_sigma_rule_generator/CLAUDE.md
bpmcdevitt eca51167af FEATURE: Add Docker Compose support for CLI application with comprehensive usage documentation
This commit adds complete Docker Compose support to the CLI application, making it easy to run
the SIGMA rule generator in a containerized environment:

DOCKER INFRASTRUCTURE:
- docker-compose.yml: Complete service orchestration (CLI app, PostgreSQL, Redis, optional Ollama)
- Dockerfile: Optimized CLI application container with all dependencies
- init.sql: Database initialization for PostgreSQL
- .env.example: Updated environment configuration for both Docker and native setups
- Makefile: Convenient commands for Docker operations (setup, up, down, shell, cli execution)

DOCUMENTATION UPDATES:
- README.md: Comprehensive Docker vs Native comparison with detailed usage examples
- CLAUDE.md: Updated project guidance with Docker Compose as recommended approach
- Added step-by-step setup instructions for both deployment methods
- Included command examples for both Docker Compose and native execution

DOCKER SERVICES:
- sigma-cli: Main CLI application container with volume mounts for data persistence
- db: PostgreSQL database for legacy migrations and data processing
- redis: Redis cache for performance optimization
- ollama: Optional local LLM service (profile-based)

DATA PERSISTENCE:
- Host-mounted directories: ./cves/, ./reports/, ./logs/, ./backend/templates/
- Named volumes: postgres_data, redis_data, ollama_data
- Complete data preservation between container restarts

This provides users with multiple deployment options:
1. Quick Docker Compose setup (recommended for testing/evaluation)
2. Native installation (recommended for production/development)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-21 13:52:28 -05:00

14 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview - CLI-Based Architecture (v2.0)

This is an enhanced CVE-SIGMA Auto Generator that has been transformed from a web application to a professional CLI tool with file-based SIGMA rule management. The system now supports:

  1. Bulk NVD Data Processing: Downloads and processes complete NVD JSON datasets (2002-2025)
  2. nomi-sec PoC Integration: Uses curated PoC data from github.com/nomi-sec/PoC-in-GitHub
  3. Enhanced SIGMA Rule Generation: Creates intelligent rules based on real exploit indicators
  4. Comprehensive Database Seeding: Supports both bulk and incremental data updates

Architecture - CLI-Based System

Current Primary Architecture (v2.0)

  • CLI Interface: Professional command-line tool (cli/sigma_cli.py) with modular commands
  • File-Based Storage: Git-friendly YAML and JSON files organized by year/CVE-ID
  • Directory Structure:
    • cves/YEAR/CVE-ID/: Individual CVE directories with metadata and multiple rule variants
    • cli/commands/: Modular command system (process, generate, search, stats, export, migrate)
    • reports/: Generated statistics and export outputs
  • Data Processing:
    • Reuses existing backend processors for CVE fetching and analysis
    • File-based rule generation with multiple variants per CVE
    • CLI-driven bulk operations and incremental updates
  • Storage Format:
    • metadata.json: CVE information, PoC data, processing history
    • rule_*.sigma: Multiple SIGMA rule variants (template, LLM, hybrid)
    • poc_analysis.json: Extracted exploit indicators and analysis

Database Components (For Migration Only)

  • Database Models: backend/database_models.py - SQLAlchemy models for data migration
  • Legacy Support: Core data processors maintained for CLI integration
  • Migration Tools: Complete CLI-based migration utilities from legacy database

Common Development Commands

# Quick start with Docker Compose
cp .env.example .env  # Edit with your API keys (optional)
docker-compose up -d  # Start all services (db, redis, CLI container)

# Access CLI in container
docker-compose exec sigma-cli bash

# Run CLI commands in container
docker-compose exec sigma-cli python cli/sigma_cli.py --help
docker-compose exec sigma-cli python cli/sigma_cli.py process year 2024

# Use Makefile shortcuts
make setup       # Initial setup
make up          # Start services
make shell       # Access CLI shell
make cli CMD="stats overview"  # Run specific CLI commands

Native CLI Installation (Alternative)

# Install CLI dependencies
pip install -r backend/requirements.txt
pip install click rich tabulate pyyaml

# Make CLI executable
chmod +x cli/sigma_cli.py

# Initialize configuration
./cli/sigma_cli.py config-init

# Test CLI installation
./cli/sigma_cli.py --help

CLI Primary Operations

# Process CVEs and generate SIGMA rules
./cli/sigma_cli.py process year 2024                    # Process specific year
./cli/sigma_cli.py process cve CVE-2024-0001            # Process specific CVE
./cli/sigma_cli.py process bulk --start-year 2020       # Bulk process years
./cli/sigma_cli.py process incremental --days 7         # Process recent changes

# Generate rules for existing CVEs
./cli/sigma_cli.py generate cve CVE-2024-0001 --method all
./cli/sigma_cli.py generate regenerate --year 2024 --method llm

# Search and analyze
./cli/sigma_cli.py search cve "buffer overflow" --severity critical --has-poc
./cli/sigma_cli.py search rules "powershell" --method llm

# Statistics and reports
./cli/sigma_cli.py stats overview --year 2024
./cli/sigma_cli.py stats poc --year 2024
./cli/sigma_cli.py stats rules --method template

# Export data
./cli/sigma_cli.py export sigma ./output-rules --format yaml --year 2024
./cli/sigma_cli.py export metadata ./reports/cve-data.csv --format csv

Migration from Web Application

# Migrate existing database to file structure
./cli/sigma_cli.py migrate from-database --database-url "postgresql://user:pass@localhost:5432/db"

# Validate migrated data
./cli/sigma_cli.py migrate validate --year 2024

# Check migration statistics
./cli/sigma_cli.py stats overview

Database Migration Support

# If you have an existing PostgreSQL database with CVE data
export DATABASE_URL="postgresql://user:pass@localhost:5432/cve_sigma_db"

# Migrate database to CLI file structure
./cli/sigma_cli.py migrate from-database --database-url $DATABASE_URL

Development and Testing

# CLI with verbose logging
./cli/sigma_cli.py --verbose process year 2024

# Test individual commands
./cli/sigma_cli.py version
./cli/sigma_cli.py config-init
./cli/sigma_cli.py stats overview

# Check file structure
ls -la cves/2024/                      # View processed CVEs
ls -la cves/2024/CVE-2024-0001/        # View individual CVE files

Key Configuration

Environment Variables (.env)

  • NVD_API_KEY: Optional NVD API key for higher rate limits (5→50 requests/30s)
  • GITHUB_TOKEN: Optional GitHub token for exploit analysis (enhances rule generation)
  • OPENAI_API_KEY: Optional OpenAI API key for AI-enhanced SIGMA rule generation
  • ANTHROPIC_API_KEY: Optional Anthropic API key for AI-enhanced SIGMA rule generation
  • OLLAMA_BASE_URL: Optional Ollama base URL for local model AI-enhanced SIGMA rule generation
  • LLM_PROVIDER: Optional LLM provider selection (openai, anthropic, ollama)
  • LLM_MODEL: Optional LLM model selection (provider-specific)
  • DATABASE_URL: PostgreSQL connection string
  • REACT_APP_API_URL: Backend API URL for frontend

CLI Configuration

  • Configuration File: ~/.sigma-cli/config.yaml (auto-created with config-init)
  • Directory Structure:
    • cves/YEAR/CVE-ID/: Individual CVE data and rules
    • reports/: Generated statistics and exports
    • cli/: Command-line tool and modules

Database Connection (For Migration Only)

  • PostgreSQL: localhost:5432 (if migrating from legacy database)
  • Connection String: Set via DATABASE_URL environment variable

Enhanced API Endpoints

Bulk Processing

  • POST /api/bulk-seed - Start complete bulk seeding (NVD + nomi-sec)
  • POST /api/incremental-update - Update with NVD modified/recent feeds
  • POST /api/sync-nomi-sec - Synchronize nomi-sec PoC data
  • POST /api/regenerate-rules - Regenerate SIGMA rules with enhanced data
  • GET /api/bulk-jobs - Get bulk processing job status
  • GET /api/bulk-status - Get comprehensive system status
  • GET /api/poc-stats - Get PoC-related statistics

Enhanced Data Access

  • GET /api/stats - Enhanced statistics with PoC coverage
  • GET /api/claude-status - Get Claude API availability status
  • All existing CVE and SIGMA rule endpoints now include enhanced data fields

LLM-Enhanced Rule Generation

  • POST /api/llm-enhanced-rules - Generate SIGMA rules using LLM AI analysis (supports multiple providers)
  • GET /api/llm-status - Check LLM API availability and configuration for all providers
  • POST /api/llm-switch - Switch between LLM providers and models

Code Architecture Details

CLI Structure (Primary)

  • cli/sigma_cli.py: Main executable CLI with Click framework
  • cli/commands/: Modular command system
    • base_command.py: Common functionality and file operations
    • process_commands.py: CVE processing and bulk operations
    • generate_commands.py: SIGMA rule generation
    • search_commands.py: Search and filtering
    • stats_commands.py: Statistics and reporting
    • export_commands.py: Data export in multiple formats
    • migrate_commands.py: Database migration tools
  • cli/config/: Configuration management
  • cli/README.md: Detailed CLI documentation

File-Based Storage Structure

  • CVE Directories: cves/YEAR/CVE-ID/ with individual metadata and rule files
  • Rule Variants: Multiple SIGMA files per CVE (template, LLM, hybrid)
  • Metadata Format: JSON files with processing history and PoC data
  • Reports: Generated statistics and export outputs

Backend Data Processors (Reused by CLI)

  • database_models.py: SQLAlchemy models for data migration
  • Data Processors: Core processing logic reused by CLI
    • nvd_bulk_processor.py: NVD JSON dataset processing
    • nomi_sec_client.py: nomi-sec PoC integration
    • enhanced_sigma_generator.py: SIGMA rule generation
    • llm_client.py: Multi-provider LLM integration
    • poc_analyzer.py: PoC content analysis

CLI-Based Data Processing Flow

  1. CVE Processing: NVD data fetch → File storage → PoC analysis → Metadata generation
  2. Rule Generation: Template/LLM/Hybrid generation → Multiple rule variants → File storage
  3. Search & Analysis: File-based searching → Statistics generation → Export capabilities
  4. Migration Support: Database export → File conversion → Validation → Cleanup

Legacy Web Processing Flow (For Reference)

  1. Bulk Seeding: NVD JSON downloads → Database storage → nomi-sec PoC sync → Enhanced rule generation
  2. Incremental Updates: NVD modified feeds → Update existing data → Sync new PoCs
  3. Rule Enhancement: PoC analysis → Indicator extraction → Template selection → Enhanced SIGMA rule
  4. LLM-Enhanced Generation: PoC content analysis → Multi-provider LLM processing → Advanced SIGMA rule creation

Development Notes

Enhanced Rule Generation Logic

The application now uses an advanced rule generation process:

  1. CVE Analysis: Extract metadata from NVD bulk data
  2. PoC Quality Assessment: nomi-sec PoC analysis with star count, recency, quality tiers
  3. Advanced Indicator Extraction: Processes, files, network, registry, commands from PoC repositories
  4. Template Selection: Smart template matching based on PoC indicators and CVE characteristics
  5. Enhanced Rule Population: Incorporate real exploit indicators with quality scoring
  6. MITRE ATT&CK Mapping: Automatic technique identification based on indicators
  7. LLM AI Enhancement: Optional multi-provider LLM integration for intelligent rule generation from PoC code analysis

Quality Tiers

  • Excellent (80+ points): High star count, recent updates, detailed descriptions
  • Good (60-79 points): Moderate quality indicators
  • Fair (40-59 points): Basic PoC with some quality indicators
  • Poor (20-39 points): Minimal quality indicators
  • Very Poor (<20 points): Low-quality PoCs

Multi-Provider LLM Integration Features

  • Multiple LLM Providers: Support for OpenAI, Anthropic, and Ollama (local models)
  • Dynamic Provider Switching: Switch between providers and models through UI or API
  • Intelligent Code Analysis: LLMs analyze actual exploit code from PoC repositories
  • Advanced Rule Generation: Creates sophisticated SIGMA rules with proper syntax and logic
  • Contextual Understanding: Interprets CVE descriptions and maps them to appropriate detection patterns
  • Automatic Validation: Generated rules are validated for SIGMA syntax compliance
  • Fallback Mechanism: Automatically falls back to template-based generation if LLM is unavailable
  • Enhanced Metadata: Rules include generation method tracking for quality assessment
  • LangChain Integration: Uses LangChain for robust LLM integration and prompt management

Supported LLM Providers and Models

OpenAI

  • API Key: Set OPENAI_API_KEY environment variable
  • Supported Models: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
  • Default Model: gpt-4o-mini
  • Rate Limits: Based on OpenAI API limits

Anthropic

  • API Key: Set ANTHROPIC_API_KEY environment variable
  • Supported Models: claude-3-5-sonnet-20241022, claude-3-haiku-20240307, claude-3-opus-20240229
  • Default Model: claude-3-5-sonnet-20241022
  • Rate Limits: Based on Anthropic API limits

Ollama (Local Models)

  • Setup: Install Ollama locally and set OLLAMA_BASE_URL (default: http://localhost:11434)
  • Supported Models: llama3.2, codellama, mistral, llama2 (any Ollama-compatible model)
  • Default Model: llama3.2
  • Rate Limits: No external API limits (local processing)

Testing and Validation

  • Frontend tests: npm test (in frontend directory)
  • Backend testing: Use standalone scripts for bulk operations
  • API testing: Use /docs endpoint for Swagger UI
  • Task Monitoring: Monitor via Flower dashboard at http://localhost:5555
  • Celery Tasks: Use celery -A celery_config worker --loglevel=info for debugging

Security Considerations

  • API Keys: Store NVD and GitHub tokens in environment variables
  • PoC Analysis: Automated analysis of curated PoC repositories (safer than raw GitHub search)
  • Rate Limiting: Built-in rate limiting for external APIs
  • Data Validation: Enhanced validation for bulk data processing
  • Audit Trail: Job tracking for all bulk operations

Troubleshooting

Common Issues

  • Bulk Processing Failures: Check /api/bulk-jobs for detailed error messages
  • NVD Data Download Issues: Verify NVD API key and network connectivity
  • nomi-sec API Timeouts: Built-in retry logic, check network connectivity
  • Frontend build errors: Run npm install in frontend directory
  • Database schema changes: Restart backend to auto-create new tables
  • Memory issues during bulk processing: Monitor system resources, consider smaller batch sizes

Enhanced Rate Limits

  • NVD API: 5 requests/30s (no key) → 50 requests/30s (with key)
  • nomi-sec API: 1 request/second (built-in rate limiting)
  • GitHub API (fallback): 60 requests/hour (no token) → 5000 requests/hour (with token)

Performance Optimization

  • Bulk Processing: Start with recent years (2020+) for faster initial setup
  • PoC Sync: Use smaller batch sizes (50) for better stability
  • Rule Generation: Monitor quality scores to prioritize high-value PoCs
  • Database: Ensure proper indexing on CVE ID and PoC fields

Monitoring

  • Frontend: Use Bulk Jobs tab for real-time progress monitoring
  • Backend logs: docker-compose logs -f backend
  • Job status: Check /api/bulk-status for comprehensive system health
  • Database: Monitor PoC coverage percentage and rule enhancement progress