8.9 Run LLM Locally in Docker
Why Run LLMs Locally?
Running Large Language Models (LLMs) locally offers several advantages for developers and organizations:
Privacy & Security
Your data never leaves your infrastructure
No risk of sensitive information being sent to third-party APIs
Complete control over data handling and storage
Cost Efficiency
No per-token charges or API fees
Predictable resource costs
Better for high-volume applications
Performance & Reliability
No network latency or internet dependency
Guaranteed availability
Consistent response times
Development & Testing
Experiment with different models freely
Debug without external API limitations
Create reproducible testing environments
Compliance
Meet strict data governance requirements
Maintain regulatory compliance (GDPR, HIPAA, etc.)
Keep intellectual property secure
Method 1: Ollama
What is Ollama?
Ollama is a lightweight, extensible framework for running large language models locally. It provides:
Simple command-line interface for model management
Optimized inference engine with quantization support
RESTful API for integration with applications
Support for popular models (Llama, Mistral, CodeLlama, etc.)
Automatic GPU acceleration when available
Setting Up Ollama
# Run Ollama container with GPU support (if available)
docker run -d --name ollama \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama:latest
# For systems without GPU, omit the --gpus flag
# docker run -d --name ollama -v ollama:/root/.ollama -p 11434:11434 ollama/ollama:latest
# Wait for the API to be ready
curl -sf http://localhost:11434/api/version
# Access the container to manage models
docker exec -it ollama bash
# Pull and run a lightweight model (good for testing)
ollama pull qwen2:1.5b
ollama run qwen2:1.5b
Finding and Choosing Models
Ollama supports many popular models. You can:
Browse available models: Visit https://ollama.com/library
Popular models by size:
Small (1-3B parameters): Good for basic tasks, fast inference
qwen2:1.5b- General purpose, multilingualphi3:mini- Microsoft’s efficient modelgemma2:2b- Google’s lightweight model
Medium (7-8B parameters): Balanced performance and speed
llama3:8b- Meta’s general purpose modelmistral:7b- Excellent for code and reasoningcodellama:7b- Specialized for code generation
Large (13B+ parameters): Best quality, requires more resources
llama3:70b- High-quality responses (requires significant RAM)mixtral:8x7b- Mixture of experts architecture
Check model requirements:
# List installed models
ollama list
# Get model information
ollama show qwen2:1.5b
Managing Models
# Pull a specific model
ollama pull llama3:8b
# Remove a model to save space
ollama rm qwen2:1.5b
# Update a model
ollama pull llama3:8b # Re-pulling updates the model
Testing Ollama Deployment
Interactive Testing
# Start an interactive session
ollama run qwen2:1.5b
# Type your questions and press Enter
# Use /bye to exit
Command Line Testing
# Single prompt execution
ollama run qwen2:1.5b "Write a 2-line poem about containers."
API Testing
# Simple API call
curl http://localhost:11434/api/generate -d '{
"model": "qwen2:1.5b",
"prompt": "Explain what a tensor is in one sentence.",
"stream": false
}'
# Streaming response (real-time output)
curl -N http://localhost:11434/api/generate -d '{
"model": "qwen2:1.5b",
"prompt": "Explain DevOps in simple terms.",
"stream": true
}'
Integration Example (Python)
import requests
import json
def ask_ollama(prompt, model="qwen2:1.5b"):
response = requests.post("http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()["response"]
# Usage
answer = ask_ollama("What is Docker?")
print(answer)
Method 2: Hugging Face Text Generation Inference (TGI)
What is TGI?
Text Generation Inference (TGI) is Hugging Face’s production-ready inference server for large language models. It offers:
Optimized Performance: Highly optimized inference engine with batching and caching
Production Ready: Built for high-throughput, low-latency serving
Quantization Support: Built-in support for 4-bit and 8-bit quantization
Streaming: Real-time token streaming for better user experience
Enterprise Features: Metrics, health checks, and monitoring endpoints
One-liner Deployment with TGI
# Set model and cache directory
MODEL=Qwen/Qwen2-1.5B-Instruct
VOL=$HOME/hf-cache
docker run -d --name tgi --gpus all --shm-size 1g \
-p 8080:80 -v "$VOL":/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $MODEL \
--quantize eetq \
--max-input-tokens 2048 --max-total-tokens 4096
For Different Hardware Configurations
# For 4GB GPU (GTX 1050 Ti, RTX 3050)
# Uses 4-bit quantization for extreme VRAM savings
MODEL=Qwen/Qwen2-1.5B-Instruct
docker run -d --name tgi --gpus all --shm-size 1g \
-p 8080:80 -v "$HOME/hf-cache":/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $MODEL \
--quantize bitsandbytes-nf4 \
--max-input-tokens 1024 --max-total-tokens 2048
# For CPU-only deployment (no GPU)
MODEL=Qwen/Qwen2-1.5B-Instruct
docker run -d --name tgi --shm-size 1g \
-p 8080:80 -v "$HOME/hf-cache":/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $MODEL \
--max-input-tokens 1024 --max-total-tokens 2048
Using Private/Gated Models
# For models requiring authentication (get token from https://huggingface.co/settings/tokens)
export HF_TOKEN=your_token_here
MODEL=meta-llama/Llama-2-7b-chat-hf # Example gated model
docker run -d --name tgi --gpus all --shm-size 1g \
-p 8080:80 -v "$HOME/hf-cache":/data \
-e HF_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $MODEL \
--quantize eetq
Testing TGI Deployment
# Wait for the model to load (check logs)
docker logs -f tgi
# Health check
curl http://localhost:8080/health
# Get model info
curl http://localhost:8080/info
API Usage Examples
# Simple completion (non-streaming)
curl -s http://localhost:8080/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "What is DevOps?",
"parameters": {
"max_new_tokens": 128,
"temperature": 0.7
},
"stream": false
}'
# Streaming response (real-time tokens)
curl -N http://localhost:8080/generate_stream \
-H 'Content-Type: application/json' \
-d '{
"inputs": "Explain containers in simple terms:",
"parameters": {
"max_new_tokens": 200,
"temperature": 0.8,
"top_p": 0.9
}
}'
# Chat completion format
curl -s http://localhost:8080/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "<|im_start|>system\nYou are a helpful DevOps assistant.<|im_end|>\n<|im_start|>user\nWhat is Docker?<|im_end|>\n<|im_start|>assistant\n",
"parameters": {
"max_new_tokens": 150,
"temperature": 0.7,
"stop": ["<|im_end|>"]
},
"stream": false
}'
Method Comparison
Aspect |
Ollama |
TGI |
Recommendation |
|---|---|---|---|
Ease of Use |
Excellent (CLI interface) |
Good (One command) |
Ollama for beginners |
Performance |
Very Good (Optimized) |
Excellent (Production) |
TGI for production |
Model Variety |
Good (Curated) |
Extensive (Any HF model) |
TGI for variety |
Quantization |
Built-in (GGML/GGUF) |
Advanced (Multiple types) |
TGI for optimization |
Production Features |
Good (Basic API) |
Excellent (Full featured) |
TGI for enterprise |
Resource Usage |
Lower (Efficient) |
Medium (Optimized) |
Ollama for limited hardware |
Streaming |
Yes (Built-in) |
Yes (Built-in) |
Both support real-time apps |
When to Use Each Method:
Ollama: Perfect for learning, experimentation, simple applications, and resource-constrained environments
TGI: Best for production deployments, high performance requirements, enterprise use, and access to the full Hugging Face model ecosystem