How to Self-Host a Model
Self-hosting AI models allows you to have complete control over your data and achieve offline AI assistance. This guide covers various methods for using self-hosted models in MyDeskBot.
Why Self-Host Models?
Privacy and Security
- Data Control: Keep sensitive code and data locally
- No External API Calls: Eliminate external data transmission
- Compliance: Meet regulatory requirements for data processing
Cost Management
- Eliminate API Costs: No per-request billing
- Predictable Expenses: Fixed infrastructure costs
- Scalability: Scale according to demand
Performance Advantages
- Low Latency: Direct access to models
- Custom Hardware: Optimize for specific hardware
- Priority Access: No waiting for shared resources
Self-Hosting Options
1. Ollama (Recommended for Beginners)
The easiest way to get started with local models:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run a model
ollama pull llama3:8b
ollama run llama3:8bConfigure in MyDeskBot:
models:
- name: "local-ollama"
provider: "ollama"
model: "llama3:8b"
baseURL: "http://localhost:11434"2. Hugging Face Transformers
Run models using the transformers library:
# Install dependencies
pip install transformers torch accelerate
# Run model server
python -m transformers-cli serve --model-id meta-llama/Llama-3-8b3. Text Generation WebUI
Web-based interface for running models:
# Clone repository
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
# Install dependencies
pip install -r requirements.txt
# Run Web UI
python server.py --model llama-3-8bHardware Requirements
CPU-Only Setup
Minimum requirements:
- Memory: 16GB (32GB recommended)
- Storage: 50GB free space
- CPU: Modern multi-core processor
Example configuration:
models:
- name: "cpu-model"
provider: "ollama"
model: "llama3:8b"
baseURL: "http://localhost:11434"
options:
num_thread: 8 # Limit CPU threadsGPU-Accelerated Setup
Recommended for better performance:
- NVIDIA GPU: 12GB+ VRAM (RTX 3080 or higher)
- AMD GPU: 12GB+ VRAM (RX 6800 or higher)
- Apple Silicon: M1/M2, 16GB+ unified memory
NVIDIA setup:
# Install NVIDIA drivers and CUDA toolkit
# Then install Ollama (automatically uses CUDA)
# Or use text-generation-webui with CUDA
python server.py --model llama-3-8b --cudaMulti-GPU Setup
For enterprise-level deployment:
models:
- name: "multi-gpu-model"
provider: "textgen"
model: "llama-3-70b"
baseURL: "http://localhost:5000"
options:
gpu_split: "20,20" # Distribute across 2 GPUsModel Selection
Small Models (5-10GB)
Suitable for basic tasks:
- Mistral 7B: 4.1GB, good balance of size and capability
- Phi-3 3.8B: 2.4GB, Microsoft's compact model
- Gemma 2B: 1.6GB, Google's lightweight model
# Download small models
ollama pull mistral:7b
ollama pull phi3:3.8b
ollama pull gemma:2bMedium Models (10-30GB)
Suitable for most development tasks:
- Llama 3 8B: 4.7GB, versatile and capable
- CodeLlama 7B: 4.1GB, coding-optimized
- Mixtral 8x7B: 45GB, Mixture-of-Experts model
# Download medium models
ollama pull llama3:8b
ollama pull codellama:7b
ollama pull mixtral:8x7bLarge Models (30GB+)
For complex tasks requiring maximum capability:
- Llama 3 70B: 40GB, state-of-the-art performance
- Mixtral 8x22B: 140GB, powerful MoE model
# Download large models (requires significant resources)
ollama pull llama3:70bDeployment Strategies
Single Machine Deployment
Simple setup for individual developers:
# Start Ollama service
ollama serve
# Pull required models
ollama pull llama3:8b
ollama pull codellama:7b
# Configure MyDeskBotmodels:
- name: "primary-model"
provider: "ollama"
model: "llama3:8b"
baseURL: "http://localhost:11434"
- name: "coding-model"
provider: "ollama"
model: "codellama:7b"
baseURL: "http://localhost:11434"Docker Deployment
Containerized deployment for consistency:
# Dockerfile
FROM ollama/ollama:latest
COPY models/ /root/.ollama/models/
EXPOSE 11434
CMD ["ollama", "serve"]# Build and run
docker build -t my-ollama .
docker run -d -p 11434:11434 my-ollamaKubernetes Deployment
For enterprise-level deployment:
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- name: models
mountPath: /root/.ollama/models
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models-pvcModel Optimization
Quantization
Reduce model size and improve inference speed:
# Use quantized models (Ollama handles automatically)
ollama pull llama3:8b-q4_0 # 4-bit quantization
ollama pull llama3:8b-q8_0 # 8-bit quantizationModel Pruning
Remove unnecessary parameters:
# Example using Hugging Face
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
# Apply pruning techniquesKnowledge Distillation
Create smaller, faster student models:
# Train a smaller model to mimic a larger model
# This requires significant machine learning expertiseSecurity Considerations
Network Security
Secure your model servers:
# Use HTTPS for model endpoints
models:
- name: "secure-model"
provider: "ollama"
model: "llama3:8b"
baseURL: "https://models.company.internal:11434"
options:
ssl_verify: trueAuthentication
Add authentication to model servers:
# For text-generation-webui
python server.py --model llama-3-8b --api-auth username:passwordAccess Control
Restrict model access:
# Configure firewall rules
# Only allow connections from trusted IPs
# Use VPN for remote accessMonitoring and Maintenance
Health Monitoring
Monitor model server health:
# Check Ollama status
ollama list
curl http://localhost:11434/api/tags
# Monitor system resources
htop
nvidia-smi # for GPU monitoringPerformance Metrics
Track performance metrics:
# Enable logging and metrics
models:
- name: "monitored-model"
provider: "ollama"
model: "llama3:8b"
baseURL: "http://localhost:11434"
options:
log_level: "info"Model Updates
Keep models up to date:
# Update Ollama
ollama pull llama3:8b # Pull latest version
# Or create maintenance schedule
# Weekly: ollama pull llama3:8bBackup and Recovery
Model Backups
Backup important models:
# Export models
ollama cp llama3:8b backup-llama3:8b
# Save to external storage
# Copy ~/.ollama/models to backup locationConfiguration Backups
Backup configurations:
# Backup MyDeskBot config
cp .mydeskbot/config.yaml ~/backups/mydeskbot-config-$(date +%Y%m%d).yaml
# Backup model configurations
cp -r ~/.ollama ~/backups/ollama-backup-$(date +%Y%m%d)Troubleshooting
Common Issues
Model Loading Failures
# Check available memory
free -h # Linux
vm_stat # macOS
# Check disk space
df -h
# Re-pull model
ollama rm llama3:8b
ollama pull llama3:8bPerformance Problems
# Monitor resource usage
htop
iotop # Disk I/O monitoring
# Adjust model parametersmodels:
- name: "optimized-model"
provider: "ollama"
model: "llama3:8b"
options:
num_thread: 6
num_gpu: 1Connection Issues
# Check if service is running
ps aux | grep ollama
# Test connection
curl http://localhost:11434/api/tags
# Check firewall settingsDebugging Commands
# Enable debug logging
OLLAMA_DEBUG=1 ollama serve
# Check logs
journalctl -u ollama -f # Linux
tail -f /usr/local/var/log/ollama.log # macOS
# Test model directly
echo '{"model":"llama3:8b","prompt":"Hello"}' | curl -X POST -H "Content-Type: application/json" -d @- http://localhost:11434/api/generateBest Practices
Model Management
- Version Control: Track model versions
- Regular Updates: Update models periodically
- Performance Testing: Test models before deployment
- Resource Planning: Plan for adequate hardware resources
Security
- Network Isolation: Keep model servers isolated
- Access Logging: Log all model access
- Regular Audits: Audit model usage regularly
- Data Encryption: Encrypt data in transit and at rest
Cost Optimization
- Right-Sizing: Choose appropriate model sizes
- Usage Monitoring: Monitor model usage
- Scheduled Scaling: Scale resources based on demand
- Model Sharing: Share models across teams
Enterprise Deployment
High Availability
Deploy redundant model servers:
# Load balancer configuration
models:
- name: "ha-model-primary"
provider: "ollama"
model: "llama3:8b"
baseURL: "http://model-server-1:11434"
- name: "ha-model-secondary"
provider: "ollama"
model: "llama3:8b"
baseURL: "http://model-server-2:11434"Disaster Recovery
Plan for disaster recovery:
# Regular backups
# Automated failover
# Cross-region replication