# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview - CLI-Based Architecture (v2.0) This is an enhanced CVE-SIGMA Auto Generator that has been **transformed from a web application to a professional CLI tool** with file-based SIGMA rule management. The system now supports: 1. **Bulk NVD Data Processing**: Downloads and processes complete NVD JSON datasets (2002-2025) 2. **nomi-sec PoC Integration**: Uses curated PoC data from github.com/nomi-sec/PoC-in-GitHub 3. **Enhanced SIGMA Rule Generation**: Creates intelligent rules based on real exploit indicators 4. **Comprehensive Database Seeding**: Supports both bulk and incremental data updates ## Architecture - CLI-Based System ### **Current Primary Architecture (v2.0)** - **CLI Interface**: Professional command-line tool (`cli/sigma_cli.py`) with modular commands - **File-Based Storage**: Git-friendly YAML and JSON files organized by year/CVE-ID - **Directory Structure**: - `cves/YEAR/CVE-ID/`: Individual CVE directories with metadata and multiple rule variants - `cli/commands/`: Modular command system (process, generate, search, stats, export, migrate) - `reports/`: Generated statistics and export outputs - **Data Processing**: - Reuses existing backend processors for CVE fetching and analysis - File-based rule generation with multiple variants per CVE - CLI-driven bulk operations and incremental updates - **Storage Format**: - `metadata.json`: CVE information, PoC data, processing history - `rule_*.sigma`: Multiple SIGMA rule variants (template, LLM, hybrid) - `poc_analysis.json`: Extracted exploit indicators and analysis ### **Database Components (For Migration Only)** - **Database Models**: `backend/database_models.py` - SQLAlchemy models for data migration - **Legacy Support**: Core data processors maintained for CLI integration - **Migration Tools**: Complete CLI-based migration utilities from legacy database ## Common Development Commands ### **Docker Compose Setup (Recommended)** ```bash # Quick start with Docker Compose cp .env.example .env # Edit with your API keys (optional) docker-compose up -d # Start all services (db, redis, CLI container) # Access CLI in container docker-compose exec sigma-cli bash # Run CLI commands in container docker-compose exec sigma-cli python cli/sigma_cli.py --help docker-compose exec sigma-cli python cli/sigma_cli.py process year 2024 # Use Makefile shortcuts make setup # Initial setup make up # Start services make shell # Access CLI shell make cli CMD="stats overview" # Run specific CLI commands ``` ### **Native CLI Installation (Alternative)** ```bash # Install CLI dependencies pip install -r backend/requirements.txt pip install click rich tabulate pyyaml # Make CLI executable chmod +x cli/sigma_cli.py # Initialize configuration ./cli/sigma_cli.py config-init # Test CLI installation ./cli/sigma_cli.py --help ``` ### **CLI Primary Operations** ```bash # Process CVEs and generate SIGMA rules ./cli/sigma_cli.py process year 2024 # Process specific year ./cli/sigma_cli.py process cve CVE-2024-0001 # Process specific CVE ./cli/sigma_cli.py process bulk --start-year 2020 # Bulk process years ./cli/sigma_cli.py process incremental --days 7 # Process recent changes # Generate rules for existing CVEs ./cli/sigma_cli.py generate cve CVE-2024-0001 --method all ./cli/sigma_cli.py generate regenerate --year 2024 --method llm # Search and analyze ./cli/sigma_cli.py search cve "buffer overflow" --severity critical --has-poc ./cli/sigma_cli.py search rules "powershell" --method llm # Statistics and reports ./cli/sigma_cli.py stats overview --year 2024 ./cli/sigma_cli.py stats poc --year 2024 ./cli/sigma_cli.py stats rules --method template # Export data ./cli/sigma_cli.py export sigma ./output-rules --format yaml --year 2024 ./cli/sigma_cli.py export metadata ./reports/cve-data.csv --format csv ``` ### **Migration from Web Application** ```bash # Migrate existing database to file structure ./cli/sigma_cli.py migrate from-database --database-url "postgresql://user:pass@localhost:5432/db" # Validate migrated data ./cli/sigma_cli.py migrate validate --year 2024 # Check migration statistics ./cli/sigma_cli.py stats overview ``` ### **Database Migration Support** ```bash # If you have an existing PostgreSQL database with CVE data export DATABASE_URL="postgresql://user:pass@localhost:5432/cve_sigma_db" # Migrate database to CLI file structure ./cli/sigma_cli.py migrate from-database --database-url $DATABASE_URL ``` ### **Development and Testing** ```bash # CLI with verbose logging ./cli/sigma_cli.py --verbose process year 2024 # Test individual commands ./cli/sigma_cli.py version ./cli/sigma_cli.py config-init ./cli/sigma_cli.py stats overview # Check file structure ls -la cves/2024/ # View processed CVEs ls -la cves/2024/CVE-2024-0001/ # View individual CVE files ``` ## Key Configuration ### Environment Variables (.env) - `NVD_API_KEY`: Optional NVD API key for higher rate limits (5→50 requests/30s) - `GITHUB_TOKEN`: Optional GitHub token for exploit analysis (enhances rule generation) - `OPENAI_API_KEY`: Optional OpenAI API key for AI-enhanced SIGMA rule generation - `ANTHROPIC_API_KEY`: Optional Anthropic API key for AI-enhanced SIGMA rule generation - `OLLAMA_BASE_URL`: Optional Ollama base URL for local model AI-enhanced SIGMA rule generation - `LLM_PROVIDER`: Optional LLM provider selection (openai, anthropic, ollama) - `LLM_MODEL`: Optional LLM model selection (provider-specific) - `DATABASE_URL`: PostgreSQL connection string - `REACT_APP_API_URL`: Backend API URL for frontend ### CLI Configuration - **Configuration File**: `~/.sigma-cli/config.yaml` (auto-created with `config-init`) - **Directory Structure**: - `cves/YEAR/CVE-ID/`: Individual CVE data and rules - `reports/`: Generated statistics and exports - `cli/`: Command-line tool and modules ### Database Connection (For Migration Only) - **PostgreSQL**: localhost:5432 (if migrating from legacy database) - **Connection String**: Set via DATABASE_URL environment variable ### Enhanced API Endpoints #### Bulk Processing - `POST /api/bulk-seed` - Start complete bulk seeding (NVD + nomi-sec) - `POST /api/incremental-update` - Update with NVD modified/recent feeds - `POST /api/sync-nomi-sec` - Synchronize nomi-sec PoC data - `POST /api/regenerate-rules` - Regenerate SIGMA rules with enhanced data - `GET /api/bulk-jobs` - Get bulk processing job status - `GET /api/bulk-status` - Get comprehensive system status - `GET /api/poc-stats` - Get PoC-related statistics #### Enhanced Data Access - `GET /api/stats` - Enhanced statistics with PoC coverage - `GET /api/claude-status` - Get Claude API availability status - All existing CVE and SIGMA rule endpoints now include enhanced data fields #### LLM-Enhanced Rule Generation - `POST /api/llm-enhanced-rules` - Generate SIGMA rules using LLM AI analysis (supports multiple providers) - `GET /api/llm-status` - Check LLM API availability and configuration for all providers - `POST /api/llm-switch` - Switch between LLM providers and models ## Code Architecture Details ### **CLI Structure (Primary)** - **cli/sigma_cli.py**: Main executable CLI with Click framework - **cli/commands/**: Modular command system - `base_command.py`: Common functionality and file operations - `process_commands.py`: CVE processing and bulk operations - `generate_commands.py`: SIGMA rule generation - `search_commands.py`: Search and filtering - `stats_commands.py`: Statistics and reporting - `export_commands.py`: Data export in multiple formats - `migrate_commands.py`: Database migration tools - **cli/config/**: Configuration management - **cli/README.md**: Detailed CLI documentation ### **File-Based Storage Structure** - **CVE Directories**: `cves/YEAR/CVE-ID/` with individual metadata and rule files - **Rule Variants**: Multiple SIGMA files per CVE (template, LLM, hybrid) - **Metadata Format**: JSON files with processing history and PoC data - **Reports**: Generated statistics and export outputs ### **Backend Data Processors (Reused by CLI)** - **database_models.py**: SQLAlchemy models for data migration - **Data Processors**: Core processing logic reused by CLI - `nvd_bulk_processor.py`: NVD JSON dataset processing - `nomi_sec_client.py`: nomi-sec PoC integration - `enhanced_sigma_generator.py`: SIGMA rule generation - `llm_client.py`: Multi-provider LLM integration - `poc_analyzer.py`: PoC content analysis ### **CLI-Based Data Processing Flow** 1. **CVE Processing**: NVD data fetch → File storage → PoC analysis → Metadata generation 2. **Rule Generation**: Template/LLM/Hybrid generation → Multiple rule variants → File storage 3. **Search & Analysis**: File-based searching → Statistics generation → Export capabilities 4. **Migration Support**: Database export → File conversion → Validation → Cleanup ### **Legacy Web Processing Flow (For Reference)** 1. **Bulk Seeding**: NVD JSON downloads → Database storage → nomi-sec PoC sync → Enhanced rule generation 2. **Incremental Updates**: NVD modified feeds → Update existing data → Sync new PoCs 3. **Rule Enhancement**: PoC analysis → Indicator extraction → Template selection → Enhanced SIGMA rule 4. **LLM-Enhanced Generation**: PoC content analysis → Multi-provider LLM processing → Advanced SIGMA rule creation ## Development Notes ### Enhanced Rule Generation Logic The application now uses an advanced rule generation process: 1. **CVE Analysis**: Extract metadata from NVD bulk data 2. **PoC Quality Assessment**: nomi-sec PoC analysis with star count, recency, quality tiers 3. **Advanced Indicator Extraction**: Processes, files, network, registry, commands from PoC repositories 4. **Template Selection**: Smart template matching based on PoC indicators and CVE characteristics 5. **Enhanced Rule Population**: Incorporate real exploit indicators with quality scoring 6. **MITRE ATT&CK Mapping**: Automatic technique identification based on indicators 7. **LLM AI Enhancement**: Optional multi-provider LLM integration for intelligent rule generation from PoC code analysis ### Quality Tiers - **Excellent** (80+ points): High star count, recent updates, detailed descriptions - **Good** (60-79 points): Moderate quality indicators - **Fair** (40-59 points): Basic PoC with some quality indicators - **Poor** (20-39 points): Minimal quality indicators - **Very Poor** (<20 points): Low-quality PoCs ### Multi-Provider LLM Integration Features - **Multiple LLM Providers**: Support for OpenAI, Anthropic, and Ollama (local models) - **Dynamic Provider Switching**: Switch between providers and models through UI or API - **Intelligent Code Analysis**: LLMs analyze actual exploit code from PoC repositories - **Advanced Rule Generation**: Creates sophisticated SIGMA rules with proper syntax and logic - **Contextual Understanding**: Interprets CVE descriptions and maps them to appropriate detection patterns - **Automatic Validation**: Generated rules are validated for SIGMA syntax compliance - **Fallback Mechanism**: Automatically falls back to template-based generation if LLM is unavailable - **Enhanced Metadata**: Rules include generation method tracking for quality assessment - **LangChain Integration**: Uses LangChain for robust LLM integration and prompt management ### Supported LLM Providers and Models #### OpenAI - **API Key**: Set `OPENAI_API_KEY` environment variable - **Supported Models**: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo - **Default Model**: gpt-4o-mini - **Rate Limits**: Based on OpenAI API limits #### Anthropic - **API Key**: Set `ANTHROPIC_API_KEY` environment variable - **Supported Models**: claude-3-5-sonnet-20241022, claude-3-haiku-20240307, claude-3-opus-20240229 - **Default Model**: claude-3-5-sonnet-20241022 - **Rate Limits**: Based on Anthropic API limits #### Ollama (Local Models) - **Setup**: Install Ollama locally and set `OLLAMA_BASE_URL` (default: http://localhost:11434) - **Supported Models**: llama3.2, codellama, mistral, llama2 (any Ollama-compatible model) - **Default Model**: llama3.2 - **Rate Limits**: No external API limits (local processing) ### Testing and Validation - **Frontend tests**: `npm test` (in frontend directory) - **Backend testing**: Use standalone scripts for bulk operations - **API testing**: Use `/docs` endpoint for Swagger UI - **Task Monitoring**: Monitor via Flower dashboard at http://localhost:5555 - **Celery Tasks**: Use `celery -A celery_config worker --loglevel=info` for debugging ### Security Considerations - **API Keys**: Store NVD and GitHub tokens in environment variables - **PoC Analysis**: Automated analysis of curated PoC repositories (safer than raw GitHub search) - **Rate Limiting**: Built-in rate limiting for external APIs - **Data Validation**: Enhanced validation for bulk data processing - **Audit Trail**: Job tracking for all bulk operations ## Troubleshooting ### Common Issues - **Bulk Processing Failures**: Check `/api/bulk-jobs` for detailed error messages - **NVD Data Download Issues**: Verify NVD API key and network connectivity - **nomi-sec API Timeouts**: Built-in retry logic, check network connectivity - **Frontend build errors**: Run `npm install` in frontend directory - **Database schema changes**: Restart backend to auto-create new tables - **Memory issues during bulk processing**: Monitor system resources, consider smaller batch sizes ### Enhanced Rate Limits - **NVD API**: 5 requests/30s (no key) → 50 requests/30s (with key) - **nomi-sec API**: 1 request/second (built-in rate limiting) - **GitHub API** (fallback): 60 requests/hour (no token) → 5000 requests/hour (with token) ### Performance Optimization - **Bulk Processing**: Start with recent years (2020+) for faster initial setup - **PoC Sync**: Use smaller batch sizes (50) for better stability - **Rule Generation**: Monitor quality scores to prioritize high-value PoCs - **Database**: Ensure proper indexing on CVE ID and PoC fields ### Monitoring - **Frontend**: Use Bulk Jobs tab for real-time progress monitoring - **Backend logs**: `docker-compose logs -f backend` - **Job status**: Check `/api/bulk-status` for comprehensive system health - **Database**: Monitor PoC coverage percentage and rule enhancement progress