A repo that houses a way for me to train local large language models locally

Find a file

bpmcdevitt d8085583dc added README		2025-07-29 16:49:18 -05:00
configs	llm training lab	2025-07-29 16:47:27 -05:00
docker	llm training lab	2025-07-29 16:47:27 -05:00
monitoring	llm training lab	2025-07-29 16:47:27 -05:00
notebooks	llm training lab	2025-07-29 16:47:27 -05:00
scripts	llm training lab	2025-07-29 16:47:27 -05:00
.env.example	llm training lab	2025-07-29 16:47:27 -05:00
.gitignore	llm training lab	2025-07-29 16:47:27 -05:00
CLAUDE.md	llm training lab	2025-07-29 16:47:27 -05:00
docker-compose.yml	llm training lab	2025-07-29 16:47:27 -05:00
README.md	added README	2025-07-29 16:49:18 -05:00

README.md

Local LLM Training Platform

A Docker-based platform for local training and fine-tuning of large language models with GPU acceleration.

Architecture

This platform provides a complete environment for LLM development with the following services:

Training: GPU-enabled container for model training and fine-tuning
Serving: API server for model inference using trained models
Jupyter: Development environment for experimentation and analysis
Monitoring: Prometheus-based metrics collection and monitoring

Prerequisites

Docker and Docker Compose
NVIDIA Docker runtime for GPU support
GPU with sufficient VRAM for your target model size
Adequate disk space for models and datasets

Quick Start

Start all services:
```
docker-compose up -d
```
Access Jupyter Lab:
- Open http://localhost:8888 in your browser
- Use the notebooks in /notebooks for experimentation

Start training:

docker-compose exec training python scripts/train.py --config configs/training_config.yaml

Serve models:
- The serving API will be available at http://localhost:8000
- Health check endpoint: http://localhost:8000/health
Monitor resources:
- Prometheus metrics at http://localhost:9090
- Check GPU usage: docker-compose exec training nvidia-smi

Directory Structure

├── checkpoints/          # Model checkpoints and saved states
├── configs/             # Training and model configurations
├── data/               # Training datasets
├── docker/             # Docker configurations for each service
│   ├── jupyter/        # Jupyter Lab environment
│   ├── serving/        # Model serving API
│   └── training/       # Training environment
├── experiments/        # Experiment tracking and results
├── models/            # Trained model artifacts
├── monitoring/        # Prometheus configuration
├── notebooks/         # Jupyter notebooks for development
└── scripts/          # Training and utility scripts

Service Management

Individual Services

Start specific services:

# Training only
docker-compose up training

# Serving only
docker-compose up serving

# Jupyter only
docker-compose up jupyter

Logs and Debugging

View service logs:

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f training

Access container shells:

# Training environment
docker-compose exec training bash

# Serving environment
docker-compose exec serving bash

Cleanup

Stop all services:

docker-compose down

Rebuild containers (after code changes):

docker-compose build --no-cache
docker-compose up -d

Configuration

Training Config: Edit configs/training_config.yaml for model and training parameters
GPU Settings: Modify CUDA_VISIBLE_DEVICES in docker-compose.yml for multi-GPU setups
Port Mappings: Adjust port mappings in docker-compose.yml if needed

Data Management

Place datasets in the data/ directory
Model files are stored in models/
Training checkpoints are saved to checkpoints/
All directories are mounted as persistent volumes

Security Notes

Never commit API keys, tokens, or sensitive data
Use environment variables for sensitive configuration
Consider data privacy when working with training datasets
Implement proper access controls for production deployments

Development Workflow

Use Jupyter Lab for data exploration and prototyping
Configure training parameters in configs/training_config.yaml
Run training jobs in the training container
Monitor progress using Prometheus metrics
Deploy trained models using the serving container

Troubleshooting

GPU not detected: Ensure NVIDIA Docker runtime is installed
Out of memory: Reduce batch size or model size in training config
Permission issues: Check file permissions in mounted volumes
Port conflicts: Modify port mappings in docker-compose.yml