A repo that houses a way for me to train local large language models locally
configs | ||
docker | ||
monitoring | ||
notebooks | ||
scripts | ||
.env.example | ||
.gitignore | ||
CLAUDE.md | ||
docker-compose.yml | ||
README.md |
Local LLM Training Platform
A Docker-based platform for local training and fine-tuning of large language models with GPU acceleration.
Architecture
This platform provides a complete environment for LLM development with the following services:
- Training: GPU-enabled container for model training and fine-tuning
- Serving: API server for model inference using trained models
- Jupyter: Development environment for experimentation and analysis
- Monitoring: Prometheus-based metrics collection and monitoring
Prerequisites
- Docker and Docker Compose
- NVIDIA Docker runtime for GPU support
- GPU with sufficient VRAM for your target model size
- Adequate disk space for models and datasets
Quick Start
-
Start all services:
docker-compose up -d
-
Access Jupyter Lab:
- Open http://localhost:8888 in your browser
- Use the notebooks in
/notebooks
for experimentation
-
Start training:
docker-compose exec training python scripts/train.py --config configs/training_config.yaml
-
Serve models:
- The serving API will be available at http://localhost:8000
- Health check endpoint: http://localhost:8000/health
-
Monitor resources:
- Prometheus metrics at http://localhost:9090
- Check GPU usage:
docker-compose exec training nvidia-smi
Directory Structure
├── checkpoints/ # Model checkpoints and saved states
├── configs/ # Training and model configurations
├── data/ # Training datasets
├── docker/ # Docker configurations for each service
│ ├── jupyter/ # Jupyter Lab environment
│ ├── serving/ # Model serving API
│ └── training/ # Training environment
├── experiments/ # Experiment tracking and results
├── models/ # Trained model artifacts
├── monitoring/ # Prometheus configuration
├── notebooks/ # Jupyter notebooks for development
└── scripts/ # Training and utility scripts
Service Management
Individual Services
Start specific services:
# Training only
docker-compose up training
# Serving only
docker-compose up serving
# Jupyter only
docker-compose up jupyter
Logs and Debugging
View service logs:
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f training
Access container shells:
# Training environment
docker-compose exec training bash
# Serving environment
docker-compose exec serving bash
Cleanup
Stop all services:
docker-compose down
Rebuild containers (after code changes):
docker-compose build --no-cache
docker-compose up -d
Configuration
- Training Config: Edit
configs/training_config.yaml
for model and training parameters - GPU Settings: Modify
CUDA_VISIBLE_DEVICES
in docker-compose.yml for multi-GPU setups - Port Mappings: Adjust port mappings in docker-compose.yml if needed
Data Management
- Place datasets in the
data/
directory - Model files are stored in
models/
- Training checkpoints are saved to
checkpoints/
- All directories are mounted as persistent volumes
Security Notes
- Never commit API keys, tokens, or sensitive data
- Use environment variables for sensitive configuration
- Consider data privacy when working with training datasets
- Implement proper access controls for production deployments
Development Workflow
- Use Jupyter Lab for data exploration and prototyping
- Configure training parameters in
configs/training_config.yaml
- Run training jobs in the training container
- Monitor progress using Prometheus metrics
- Deploy trained models using the serving container
Troubleshooting
- GPU not detected: Ensure NVIDIA Docker runtime is installed
- Out of memory: Reduce batch size or model size in training config
- Permission issues: Check file permissions in mounted volumes
- Port conflicts: Modify port mappings in docker-compose.yml