8.9 Run LLM Locally in Docker

Why Run LLMs Locally?

Running Large Language Models (LLMs) locally offers several advantages for developers and organizations:

Privacy & Security

Your data never leaves your infrastructure
No risk of sensitive information being sent to third-party APIs
Complete control over data handling and storage

Cost Efficiency

No per-token charges or API fees
Predictable resource costs
Better for high-volume applications

Performance & Reliability

No network latency or internet dependency
Guaranteed availability
Consistent response times

Development & Testing

Experiment with different models freely
Debug without external API limitations
Create reproducible testing environments

Compliance

Meet strict data governance requirements
Maintain regulatory compliance (GDPR, HIPAA, etc.)
Keep intellectual property secure

Method 1: Ollama

What is Ollama?

Ollama is a lightweight, extensible framework for running large language models locally. It provides:

Simple command-line interface for model management
Optimized inference engine with quantization support
RESTful API for integration with applications
Support for popular models (Llama, Mistral, CodeLlama, etc.)
Automatic GPU acceleration when available

Setting Up Ollama

# Run Ollama container with GPU support (if available)
docker run -d --name ollama \
    --gpus all \
    -v ollama:/root/.ollama \
    -p 11434:11434 \
    ollama/ollama:latest

# For systems without GPU, omit the --gpus flag
# docker run -d --name ollama -v ollama:/root/.ollama -p 11434:11434 ollama/ollama:latest

# Wait for the API to be ready
curl -sf http://localhost:11434/api/version

# Access the container to manage models
docker exec -it ollama bash

# Pull and run a lightweight model (good for testing)
ollama pull qwen2:1.5b
ollama run qwen2:1.5b

Finding and Choosing Models

Ollama supports many popular models. You can:

Browse available models: Visit https://ollama.com/library
Popular models by size:
- Small (1-3B parameters): Good for basic tasks, fast inference
  - qwen2:1.5b - General purpose, multilingual
  - phi3:mini - Microsoft’s efficient model
  - gemma2:2b - Google’s lightweight model
- Medium (7-8B parameters): Balanced performance and speed
  - llama3:8b - Meta’s general purpose model
  - mistral:7b - Excellent for code and reasoning
  - codellama:7b - Specialized for code generation
- Large (13B+ parameters): Best quality, requires more resources
  - llama3:70b - High-quality responses (requires significant RAM)
  - mixtral:8x7b - Mixture of experts architecture
Check model requirements:

# List installed models
ollama list

# Get model information
ollama show qwen2:1.5b

Managing Models

# Pull a specific model
ollama pull llama3:8b

# Remove a model to save space
ollama rm qwen2:1.5b

# Update a model
ollama pull llama3:8b  # Re-pulling updates the model

Testing Ollama Deployment

Interactive Testing

# Start an interactive session
ollama run qwen2:1.5b
# Type your questions and press Enter
# Use /bye to exit

Command Line Testing

# Single prompt execution
ollama run qwen2:1.5b "Write a 2-line poem about containers."

API Testing

# Simple API call
curl http://localhost:11434/api/generate -d '{
    "model": "qwen2:1.5b",
    "prompt": "Explain what a tensor is in one sentence.",
    "stream": false
}'

# Streaming response (real-time output)
curl -N http://localhost:11434/api/generate -d '{
    "model": "qwen2:1.5b",
    "prompt": "Explain DevOps in simple terms.",
    "stream": true
}'

Integration Example (Python)

import requests
import json

def ask_ollama(prompt, model="qwen2:1.5b"):
    response = requests.post("http://localhost:11434/api/generate",
                           json={
                               "model": model,
                               "prompt": prompt,
                               "stream": False
                           })
    return response.json()["response"]

# Usage
answer = ask_ollama("What is Docker?")
print(answer)

Method 2: Hugging Face Text Generation Inference (TGI)

What is TGI?

Text Generation Inference (TGI) is Hugging Face’s production-ready inference server for large language models. It offers:

Optimized Performance: Highly optimized inference engine with batching and caching
Production Ready: Built for high-throughput, low-latency serving
Quantization Support: Built-in support for 4-bit and 8-bit quantization
Streaming: Real-time token streaming for better user experience
Enterprise Features: Metrics, health checks, and monitoring endpoints

One-liner Deployment with TGI

# Set model and cache directory
MODEL=Qwen/Qwen2-1.5B-Instruct
VOL=$HOME/hf-cache

docker run -d --name tgi --gpus all --shm-size 1g \
  -p 8080:80 -v "$VOL":/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --quantize eetq \
  --max-input-tokens 2048 --max-total-tokens 4096

For Different Hardware Configurations

# For 4GB GPU (GTX 1050 Ti, RTX 3050)
# Uses 4-bit quantization for extreme VRAM savings
MODEL=Qwen/Qwen2-1.5B-Instruct
docker run -d --name tgi --gpus all --shm-size 1g \
  -p 8080:80 -v "$HOME/hf-cache":/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --quantize bitsandbytes-nf4 \
  --max-input-tokens 1024 --max-total-tokens 2048

# For CPU-only deployment (no GPU)
MODEL=Qwen/Qwen2-1.5B-Instruct
docker run -d --name tgi --shm-size 1g \
  -p 8080:80 -v "$HOME/hf-cache":/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --max-input-tokens 1024 --max-total-tokens 2048

Using Private/Gated Models

# For models requiring authentication (get token from https://huggingface.co/settings/tokens)
export HF_TOKEN=your_token_here

MODEL=meta-llama/Llama-2-7b-chat-hf  # Example gated model
docker run -d --name tgi --gpus all --shm-size 1g \
  -p 8080:80 -v "$HOME/hf-cache":/data \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --quantize eetq

Testing TGI Deployment

# Wait for the model to load (check logs)
docker logs -f tgi

# Health check
curl http://localhost:8080/health

# Get model info
curl http://localhost:8080/info

API Usage Examples

# Simple completion (non-streaming)
curl -s http://localhost:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "What is DevOps?",
    "parameters": {
      "max_new_tokens": 128,
      "temperature": 0.7
    },
    "stream": false
  }'

# Streaming response (real-time tokens)
curl -N http://localhost:8080/generate_stream \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Explain containers in simple terms:",
    "parameters": {
      "max_new_tokens": 200,
      "temperature": 0.8,
      "top_p": 0.9
    }
  }'

# Chat completion format
curl -s http://localhost:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "<|im_start|>system\nYou are a helpful DevOps assistant.<|im_end|>\n<|im_start|>user\nWhat is Docker?<|im_end|>\n<|im_start|>assistant\n",
    "parameters": {
      "max_new_tokens": 150,
      "temperature": 0.7,
      "stop": ["<|im_end|>"]
    },
    "stream": false
  }'

Method Comparison

Aspect	Ollama	TGI	Recommendation
Ease of Use	Excellent (CLI interface)	Good (One command)	Ollama for beginners
Performance	Very Good (Optimized)	Excellent (Production)	TGI for production
Model Variety	Good (Curated)	Extensive (Any HF model)	TGI for variety
Quantization	Built-in (GGML/GGUF)	Advanced (Multiple types)	TGI for optimization
Production Features	Good (Basic API)	Excellent (Full featured)	TGI for enterprise
Resource Usage	Lower (Efficient)	Medium (Optimized)	Ollama for limited hardware
Streaming	Yes (Built-in)	Yes (Built-in)	Both support real-time apps

When to Use Each Method:

Ollama: Perfect for learning, experimentation, simple applications, and resource-constrained environments
TGI: Best for production deployments, high performance requirements, enterprise use, and access to the full Hugging Face model ecosystem