A repo that houses a way for me to train local large language models locally
Find a file
2025-07-29 16:49:18 -05:00
configs llm training lab 2025-07-29 16:47:27 -05:00
docker llm training lab 2025-07-29 16:47:27 -05:00
monitoring llm training lab 2025-07-29 16:47:27 -05:00
notebooks llm training lab 2025-07-29 16:47:27 -05:00
scripts llm training lab 2025-07-29 16:47:27 -05:00
.env.example llm training lab 2025-07-29 16:47:27 -05:00
.gitignore llm training lab 2025-07-29 16:47:27 -05:00
CLAUDE.md llm training lab 2025-07-29 16:47:27 -05:00
docker-compose.yml llm training lab 2025-07-29 16:47:27 -05:00
README.md added README 2025-07-29 16:49:18 -05:00

Local LLM Training Platform

A Docker-based platform for local training and fine-tuning of large language models with GPU acceleration.

Architecture

This platform provides a complete environment for LLM development with the following services:

  • Training: GPU-enabled container for model training and fine-tuning
  • Serving: API server for model inference using trained models
  • Jupyter: Development environment for experimentation and analysis
  • Monitoring: Prometheus-based metrics collection and monitoring

Prerequisites

  • Docker and Docker Compose
  • NVIDIA Docker runtime for GPU support
  • GPU with sufficient VRAM for your target model size
  • Adequate disk space for models and datasets

Quick Start

  1. Start all services:

    docker-compose up -d
    
  2. Access Jupyter Lab:

  3. Start training:

    docker-compose exec training python scripts/train.py --config configs/training_config.yaml
    
  4. Serve models:

  5. Monitor resources:

Directory Structure

├── checkpoints/          # Model checkpoints and saved states
├── configs/             # Training and model configurations
├── data/               # Training datasets
├── docker/             # Docker configurations for each service
│   ├── jupyter/        # Jupyter Lab environment
│   ├── serving/        # Model serving API
│   └── training/       # Training environment
├── experiments/        # Experiment tracking and results
├── models/            # Trained model artifacts
├── monitoring/        # Prometheus configuration
├── notebooks/         # Jupyter notebooks for development
└── scripts/          # Training and utility scripts

Service Management

Individual Services

Start specific services:

# Training only
docker-compose up training

# Serving only
docker-compose up serving

# Jupyter only
docker-compose up jupyter

Logs and Debugging

View service logs:

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f training

Access container shells:

# Training environment
docker-compose exec training bash

# Serving environment
docker-compose exec serving bash

Cleanup

Stop all services:

docker-compose down

Rebuild containers (after code changes):

docker-compose build --no-cache
docker-compose up -d

Configuration

  • Training Config: Edit configs/training_config.yaml for model and training parameters
  • GPU Settings: Modify CUDA_VISIBLE_DEVICES in docker-compose.yml for multi-GPU setups
  • Port Mappings: Adjust port mappings in docker-compose.yml if needed

Data Management

  • Place datasets in the data/ directory
  • Model files are stored in models/
  • Training checkpoints are saved to checkpoints/
  • All directories are mounted as persistent volumes

Security Notes

  • Never commit API keys, tokens, or sensitive data
  • Use environment variables for sensitive configuration
  • Consider data privacy when working with training datasets
  • Implement proper access controls for production deployments

Development Workflow

  1. Use Jupyter Lab for data exploration and prototyping
  2. Configure training parameters in configs/training_config.yaml
  3. Run training jobs in the training container
  4. Monitor progress using Prometheus metrics
  5. Deploy trained models using the serving container

Troubleshooting

  • GPU not detected: Ensure NVIDIA Docker runtime is installed
  • Out of memory: Reduce batch size or model size in training config
  • Permission issues: Check file permissions in mounted volumes
  • Port conflicts: Modify port mappings in docker-compose.yml