# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview - CLI-Based Architecture (v2.0)

This is an enhanced CVE-SIGMA Auto Generator that has been **transformed from a web application to a professional CLI tool** with file-based SIGMA rule management. The system now supports:

1. **Bulk NVD Data Processing**: Downloads and processes complete NVD JSON datasets (2002-2025)
2. **nomi-sec PoC Integration**: Uses curated PoC data from github.com/nomi-sec/PoC-in-GitHub
3. **Enhanced SIGMA Rule Generation**: Creates intelligent rules based on real exploit indicators
4. **Comprehensive Database Seeding**: Supports both bulk and incremental data updates

## Architecture - CLI-Based System

### **Current Primary Architecture (v2.0)**
- **CLI Interface**: Professional command-line tool (`cli/sigma_cli.py`) with modular commands
- **File-Based Storage**: Git-friendly YAML and JSON files organized by year/CVE-ID
- **Directory Structure**: 
  - `cves/YEAR/CVE-ID/`: Individual CVE directories with metadata and multiple rule variants
  - `cli/commands/`: Modular command system (process, generate, search, stats, export, migrate)
  - `reports/`: Generated statistics and export outputs
- **Data Processing**:
  - Reuses existing backend processors for CVE fetching and analysis
  - File-based rule generation with multiple variants per CVE
  - CLI-driven bulk operations and incremental updates
- **Storage Format**: 
  - `metadata.json`: CVE information, PoC data, processing history
  - `rule_*.sigma`: Multiple SIGMA rule variants (template, LLM, hybrid)
  - `poc_analysis.json`: Extracted exploit indicators and analysis

### **Database Components (For Migration Only)**
- **Database Models**: `backend/database_models.py` - SQLAlchemy models for data migration
- **Legacy Support**: Core data processors maintained for CLI integration
- **Migration Tools**: Complete CLI-based migration utilities from legacy database

## Common Development Commands

### **Docker Compose Setup (Recommended)**
```bash
# Quick start with Docker Compose
cp .env.example .env  # Edit with your API keys (optional)
docker-compose up -d  # Start all services (db, redis, CLI container)

# Access CLI in container
docker-compose exec sigma-cli bash

# Run CLI commands in container
docker-compose exec sigma-cli python cli/sigma_cli.py --help
docker-compose exec sigma-cli python cli/sigma_cli.py process year 2024

# Use Makefile shortcuts
make setup       # Initial setup
make up          # Start services
make shell       # Access CLI shell
make cli CMD="stats overview"  # Run specific CLI commands
```

### **Native CLI Installation (Alternative)**
```bash
# Install CLI dependencies
pip install -r backend/requirements.txt
pip install click rich tabulate pyyaml

# Make CLI executable
chmod +x cli/sigma_cli.py

# Initialize configuration
./cli/sigma_cli.py config-init

# Test CLI installation
./cli/sigma_cli.py --help
```

### **CLI Primary Operations**
```bash
# Process CVEs and generate SIGMA rules
./cli/sigma_cli.py process year 2024                    # Process specific year
./cli/sigma_cli.py process cve CVE-2024-0001            # Process specific CVE
./cli/sigma_cli.py process bulk --start-year 2020       # Bulk process years
./cli/sigma_cli.py process incremental --days 7         # Process recent changes

# Generate rules for existing CVEs
./cli/sigma_cli.py generate cve CVE-2024-0001 --method all
./cli/sigma_cli.py generate regenerate --year 2024 --method llm

# Search and analyze
./cli/sigma_cli.py search cve "buffer overflow" --severity critical --has-poc
./cli/sigma_cli.py search rules "powershell" --method llm

# Statistics and reports
./cli/sigma_cli.py stats overview --year 2024
./cli/sigma_cli.py stats poc --year 2024
./cli/sigma_cli.py stats rules --method template

# Export data
./cli/sigma_cli.py export sigma ./output-rules --format yaml --year 2024
./cli/sigma_cli.py export metadata ./reports/cve-data.csv --format csv
```

### **Migration from Web Application**
```bash
# Migrate existing database to file structure
./cli/sigma_cli.py migrate from-database --database-url "postgresql://user:pass@localhost:5432/db"

# Validate migrated data
./cli/sigma_cli.py migrate validate --year 2024

# Check migration statistics
./cli/sigma_cli.py stats overview
```

### **Database Migration Support**
```bash
# If you have an existing PostgreSQL database with CVE data
export DATABASE_URL="postgresql://user:pass@localhost:5432/cve_sigma_db"

# Migrate database to CLI file structure
./cli/sigma_cli.py migrate from-database --database-url $DATABASE_URL
```

### **Development and Testing**
```bash
# CLI with verbose logging
./cli/sigma_cli.py --verbose process year 2024

# Test individual commands
./cli/sigma_cli.py version
./cli/sigma_cli.py config-init
./cli/sigma_cli.py stats overview

# Check file structure
ls -la cves/2024/                      # View processed CVEs
ls -la cves/2024/CVE-2024-0001/        # View individual CVE files
```

## Key Configuration

### Environment Variables (.env)
- `NVD_API_KEY`: Optional NVD API key for higher rate limits (5→50 requests/30s)
- `GITHUB_TOKEN`: Optional GitHub token for exploit analysis (enhances rule generation)
- `OPENAI_API_KEY`: Optional OpenAI API key for AI-enhanced SIGMA rule generation
- `ANTHROPIC_API_KEY`: Optional Anthropic API key for AI-enhanced SIGMA rule generation
- `OLLAMA_BASE_URL`: Optional Ollama base URL for local model AI-enhanced SIGMA rule generation
- `LLM_PROVIDER`: Optional LLM provider selection (openai, anthropic, ollama)
- `LLM_MODEL`: Optional LLM model selection (provider-specific)
- `DATABASE_URL`: PostgreSQL connection string
- `REACT_APP_API_URL`: Backend API URL for frontend

### CLI Configuration
- **Configuration File**: `~/.sigma-cli/config.yaml` (auto-created with `config-init`)
- **Directory Structure**: 
  - `cves/YEAR/CVE-ID/`: Individual CVE data and rules
  - `reports/`: Generated statistics and exports
  - `cli/`: Command-line tool and modules

### Database Connection (For Migration Only)
- **PostgreSQL**: localhost:5432 (if migrating from legacy database)
- **Connection String**: Set via DATABASE_URL environment variable

### Enhanced API Endpoints

#### Bulk Processing
- `POST /api/bulk-seed` - Start complete bulk seeding (NVD + nomi-sec)
- `POST /api/incremental-update` - Update with NVD modified/recent feeds
- `POST /api/sync-nomi-sec` - Synchronize nomi-sec PoC data
- `POST /api/regenerate-rules` - Regenerate SIGMA rules with enhanced data
- `GET /api/bulk-jobs` - Get bulk processing job status
- `GET /api/bulk-status` - Get comprehensive system status
- `GET /api/poc-stats` - Get PoC-related statistics

#### Enhanced Data Access
- `GET /api/stats` - Enhanced statistics with PoC coverage
- `GET /api/claude-status` - Get Claude API availability status
- All existing CVE and SIGMA rule endpoints now include enhanced data fields

#### LLM-Enhanced Rule Generation
- `POST /api/llm-enhanced-rules` - Generate SIGMA rules using LLM AI analysis (supports multiple providers)
- `GET /api/llm-status` - Check LLM API availability and configuration for all providers
- `POST /api/llm-switch` - Switch between LLM providers and models

## Code Architecture Details

### **CLI Structure (Primary)**
- **cli/sigma_cli.py**: Main executable CLI with Click framework
- **cli/commands/**: Modular command system
  - `base_command.py`: Common functionality and file operations
  - `process_commands.py`: CVE processing and bulk operations
  - `generate_commands.py`: SIGMA rule generation
  - `search_commands.py`: Search and filtering
  - `stats_commands.py`: Statistics and reporting
  - `export_commands.py`: Data export in multiple formats
  - `migrate_commands.py`: Database migration tools
- **cli/config/**: Configuration management
- **cli/README.md**: Detailed CLI documentation

### **File-Based Storage Structure**
- **CVE Directories**: `cves/YEAR/CVE-ID/` with individual metadata and rule files
- **Rule Variants**: Multiple SIGMA files per CVE (template, LLM, hybrid)
- **Metadata Format**: JSON files with processing history and PoC data
- **Reports**: Generated statistics and export outputs

### **Backend Data Processors (Reused by CLI)**
- **database_models.py**: SQLAlchemy models for data migration
- **Data Processors**: Core processing logic reused by CLI
  - `nvd_bulk_processor.py`: NVD JSON dataset processing
  - `nomi_sec_client.py`: nomi-sec PoC integration
  - `enhanced_sigma_generator.py`: SIGMA rule generation
  - `llm_client.py`: Multi-provider LLM integration
  - `poc_analyzer.py`: PoC content analysis

### **CLI-Based Data Processing Flow**
1. **CVE Processing**: NVD data fetch → File storage → PoC analysis → Metadata generation
2. **Rule Generation**: Template/LLM/Hybrid generation → Multiple rule variants → File storage
3. **Search & Analysis**: File-based searching → Statistics generation → Export capabilities
4. **Migration Support**: Database export → File conversion → Validation → Cleanup

### **Legacy Web Processing Flow (For Reference)**
1. **Bulk Seeding**: NVD JSON downloads → Database storage → nomi-sec PoC sync → Enhanced rule generation
2. **Incremental Updates**: NVD modified feeds → Update existing data → Sync new PoCs
3. **Rule Enhancement**: PoC analysis → Indicator extraction → Template selection → Enhanced SIGMA rule
4. **LLM-Enhanced Generation**: PoC content analysis → Multi-provider LLM processing → Advanced SIGMA rule creation

## Development Notes

### Enhanced Rule Generation Logic
The application now uses an advanced rule generation process:
1. **CVE Analysis**: Extract metadata from NVD bulk data
2. **PoC Quality Assessment**: nomi-sec PoC analysis with star count, recency, quality tiers
3. **Advanced Indicator Extraction**: Processes, files, network, registry, commands from PoC repositories
4. **Template Selection**: Smart template matching based on PoC indicators and CVE characteristics
5. **Enhanced Rule Population**: Incorporate real exploit indicators with quality scoring
6. **MITRE ATT&CK Mapping**: Automatic technique identification based on indicators
7. **LLM AI Enhancement**: Optional multi-provider LLM integration for intelligent rule generation from PoC code analysis

### Quality Tiers
- **Excellent** (80+ points): High star count, recent updates, detailed descriptions
- **Good** (60-79 points): Moderate quality indicators
- **Fair** (40-59 points): Basic PoC with some quality indicators
- **Poor** (20-39 points): Minimal quality indicators
- **Very Poor** (<20 points): Low-quality PoCs

### Multi-Provider LLM Integration Features
- **Multiple LLM Providers**: Support for OpenAI, Anthropic, and Ollama (local models)
- **Dynamic Provider Switching**: Switch between providers and models through UI or API
- **Intelligent Code Analysis**: LLMs analyze actual exploit code from PoC repositories
- **Advanced Rule Generation**: Creates sophisticated SIGMA rules with proper syntax and logic
- **Contextual Understanding**: Interprets CVE descriptions and maps them to appropriate detection patterns
- **Automatic Validation**: Generated rules are validated for SIGMA syntax compliance
- **Fallback Mechanism**: Automatically falls back to template-based generation if LLM is unavailable
- **Enhanced Metadata**: Rules include generation method tracking for quality assessment
- **LangChain Integration**: Uses LangChain for robust LLM integration and prompt management

### Supported LLM Providers and Models

#### OpenAI
- **API Key**: Set `OPENAI_API_KEY` environment variable
- **Supported Models**: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
- **Default Model**: gpt-4o-mini
- **Rate Limits**: Based on OpenAI API limits

#### Anthropic
- **API Key**: Set `ANTHROPIC_API_KEY` environment variable
- **Supported Models**: claude-3-5-sonnet-20241022, claude-3-haiku-20240307, claude-3-opus-20240229
- **Default Model**: claude-3-5-sonnet-20241022
- **Rate Limits**: Based on Anthropic API limits

#### Ollama (Local Models)
- **Setup**: Install Ollama locally and set `OLLAMA_BASE_URL` (default: http://localhost:11434)
- **Supported Models**: llama3.2, codellama, mistral, llama2 (any Ollama-compatible model)
- **Default Model**: llama3.2
- **Rate Limits**: No external API limits (local processing)

### Testing and Validation
- **Frontend tests**: `npm test` (in frontend directory)
- **Backend testing**: Use standalone scripts for bulk operations
- **API testing**: Use `/docs` endpoint for Swagger UI
- **Task Monitoring**: Monitor via Flower dashboard at http://localhost:5555
- **Celery Tasks**: Use `celery -A celery_config worker --loglevel=info` for debugging

### Security Considerations
- **API Keys**: Store NVD and GitHub tokens in environment variables
- **PoC Analysis**: Automated analysis of curated PoC repositories (safer than raw GitHub search)
- **Rate Limiting**: Built-in rate limiting for external APIs
- **Data Validation**: Enhanced validation for bulk data processing
- **Audit Trail**: Job tracking for all bulk operations

## Troubleshooting

### Common Issues
- **Bulk Processing Failures**: Check `/api/bulk-jobs` for detailed error messages
- **NVD Data Download Issues**: Verify NVD API key and network connectivity
- **nomi-sec API Timeouts**: Built-in retry logic, check network connectivity
- **Frontend build errors**: Run `npm install` in frontend directory
- **Database schema changes**: Restart backend to auto-create new tables
- **Memory issues during bulk processing**: Monitor system resources, consider smaller batch sizes

### Enhanced Rate Limits
- **NVD API**: 5 requests/30s (no key) → 50 requests/30s (with key)
- **nomi-sec API**: 1 request/second (built-in rate limiting)
- **GitHub API** (fallback): 60 requests/hour (no token) → 5000 requests/hour (with token)

### Performance Optimization
- **Bulk Processing**: Start with recent years (2020+) for faster initial setup
- **PoC Sync**: Use smaller batch sizes (50) for better stability
- **Rule Generation**: Monitor quality scores to prioritize high-value PoCs
- **Database**: Ensure proper indexing on CVE ID and PoC fields

### Monitoring
- **Frontend**: Use Bulk Jobs tab for real-time progress monitoring
- **Backend logs**: `docker-compose logs -f backend`
- **Job status**: Check `/api/bulk-status` for comprehensive system health
- **Database**: Monitor PoC coverage percentage and rule enhancement progress