8.9 Run LLM Locally in Docker

Why Run LLMs Locally?

Running Large Language Models (LLMs) locally offers several advantages for developers and organizations:

Privacy & Security

  • Your data never leaves your infrastructure

  • No risk of sensitive information being sent to third-party APIs

  • Complete control over data handling and storage

Cost Efficiency

  • No per-token charges or API fees

  • Predictable resource costs

  • Better for high-volume applications

Performance & Reliability

  • No network latency or internet dependency

  • Guaranteed availability

  • Consistent response times

Development & Testing

  • Experiment with different models freely

  • Debug without external API limitations

  • Create reproducible testing environments

Compliance

  • Meet strict data governance requirements

  • Maintain regulatory compliance (GDPR, HIPAA, etc.)

  • Keep intellectual property secure

Method 1: Ollama

What is Ollama?

Ollama is a lightweight, extensible framework for running large language models locally. It provides:

  • Simple command-line interface for model management

  • Optimized inference engine with quantization support

  • RESTful API for integration with applications

  • Support for popular models (Llama, Mistral, CodeLlama, etc.)

  • Automatic GPU acceleration when available

Setting Up Ollama

# Run Ollama container with GPU support (if available)
docker run -d --name ollama \
    --gpus all \
    -v ollama:/root/.ollama \
    -p 11434:11434 \
    ollama/ollama:latest

# For systems without GPU, omit the --gpus flag
# docker run -d --name ollama -v ollama:/root/.ollama -p 11434:11434 ollama/ollama:latest

# Wait for the API to be ready
curl -sf http://localhost:11434/api/version

# Access the container to manage models
docker exec -it ollama bash

# Pull and run a lightweight model (good for testing)
ollama pull qwen2:1.5b
ollama run qwen2:1.5b

Finding and Choosing Models

Ollama supports many popular models. You can:

  1. Browse available models: Visit https://ollama.com/library

  2. Popular models by size:

    • Small (1-3B parameters): Good for basic tasks, fast inference

      • qwen2:1.5b - General purpose, multilingual

      • phi3:mini - Microsoft’s efficient model

      • gemma2:2b - Google’s lightweight model

    • Medium (7-8B parameters): Balanced performance and speed

      • llama3:8b - Meta’s general purpose model

      • mistral:7b - Excellent for code and reasoning

      • codellama:7b - Specialized for code generation

    • Large (13B+ parameters): Best quality, requires more resources

      • llama3:70b - High-quality responses (requires significant RAM)

      • mixtral:8x7b - Mixture of experts architecture

  3. Check model requirements:

# List installed models
ollama list

# Get model information
ollama show qwen2:1.5b

Managing Models

# Pull a specific model
ollama pull llama3:8b

# Remove a model to save space
ollama rm qwen2:1.5b

# Update a model
ollama pull llama3:8b  # Re-pulling updates the model

Testing Ollama Deployment

Interactive Testing

# Start an interactive session
ollama run qwen2:1.5b
# Type your questions and press Enter
# Use /bye to exit

Command Line Testing

# Single prompt execution
ollama run qwen2:1.5b "Write a 2-line poem about containers."

API Testing

# Simple API call
curl http://localhost:11434/api/generate -d '{
    "model": "qwen2:1.5b",
    "prompt": "Explain what a tensor is in one sentence.",
    "stream": false
}'
# Streaming response (real-time output)
curl -N http://localhost:11434/api/generate -d '{
    "model": "qwen2:1.5b",
    "prompt": "Explain DevOps in simple terms.",
    "stream": true
}'

Integration Example (Python)

import requests
import json

def ask_ollama(prompt, model="qwen2:1.5b"):
    response = requests.post("http://localhost:11434/api/generate",
                           json={
                               "model": model,
                               "prompt": prompt,
                               "stream": False
                           })
    return response.json()["response"]

# Usage
answer = ask_ollama("What is Docker?")
print(answer)

Method 2: Hugging Face Text Generation Inference (TGI)

What is TGI?

Text Generation Inference (TGI) is Hugging Face’s production-ready inference server for large language models. It offers:

  • Optimized Performance: Highly optimized inference engine with batching and caching

  • Production Ready: Built for high-throughput, low-latency serving

  • Quantization Support: Built-in support for 4-bit and 8-bit quantization

  • Streaming: Real-time token streaming for better user experience

  • Enterprise Features: Metrics, health checks, and monitoring endpoints

One-liner Deployment with TGI

# Set model and cache directory
MODEL=Qwen/Qwen2-1.5B-Instruct
VOL=$HOME/hf-cache

docker run -d --name tgi --gpus all --shm-size 1g \
  -p 8080:80 -v "$VOL":/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --quantize eetq \
  --max-input-tokens 2048 --max-total-tokens 4096

For Different Hardware Configurations

# For 4GB GPU (GTX 1050 Ti, RTX 3050)
# Uses 4-bit quantization for extreme VRAM savings
MODEL=Qwen/Qwen2-1.5B-Instruct
docker run -d --name tgi --gpus all --shm-size 1g \
  -p 8080:80 -v "$HOME/hf-cache":/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --quantize bitsandbytes-nf4 \
  --max-input-tokens 1024 --max-total-tokens 2048
# For CPU-only deployment (no GPU)
MODEL=Qwen/Qwen2-1.5B-Instruct
docker run -d --name tgi --shm-size 1g \
  -p 8080:80 -v "$HOME/hf-cache":/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --max-input-tokens 1024 --max-total-tokens 2048

Using Private/Gated Models

# For models requiring authentication (get token from https://huggingface.co/settings/tokens)
export HF_TOKEN=your_token_here

MODEL=meta-llama/Llama-2-7b-chat-hf  # Example gated model
docker run -d --name tgi --gpus all --shm-size 1g \
  -p 8080:80 -v "$HOME/hf-cache":/data \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id $MODEL \
  --quantize eetq

Testing TGI Deployment

# Wait for the model to load (check logs)
docker logs -f tgi

# Health check
curl http://localhost:8080/health

# Get model info
curl http://localhost:8080/info

API Usage Examples

# Simple completion (non-streaming)
curl -s http://localhost:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "What is DevOps?",
    "parameters": {
      "max_new_tokens": 128,
      "temperature": 0.7
    },
    "stream": false
  }'
# Streaming response (real-time tokens)
curl -N http://localhost:8080/generate_stream \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "Explain containers in simple terms:",
    "parameters": {
      "max_new_tokens": 200,
      "temperature": 0.8,
      "top_p": 0.9
    }
  }'
# Chat completion format
curl -s http://localhost:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "<|im_start|>system\nYou are a helpful DevOps assistant.<|im_end|>\n<|im_start|>user\nWhat is Docker?<|im_end|>\n<|im_start|>assistant\n",
    "parameters": {
      "max_new_tokens": 150,
      "temperature": 0.7,
      "stop": ["<|im_end|>"]
    },
    "stream": false
  }'

Method Comparison

Aspect

Ollama

TGI

Recommendation

Ease of Use

Excellent (CLI interface)

Good (One command)

Ollama for beginners

Performance

Very Good (Optimized)

Excellent (Production)

TGI for production

Model Variety

Good (Curated)

Extensive (Any HF model)

TGI for variety

Quantization

Built-in (GGML/GGUF)

Advanced (Multiple types)

TGI for optimization

Production Features

Good (Basic API)

Excellent (Full featured)

TGI for enterprise

Resource Usage

Lower (Efficient)

Medium (Optimized)

Ollama for limited hardware

Streaming

Yes (Built-in)

Yes (Built-in)

Both support real-time apps

When to Use Each Method:

  • Ollama: Perfect for learning, experimentation, simple applications, and resource-constrained environments

  • TGI: Best for production deployments, high performance requirements, enterprise use, and access to the full Hugging Face model ecosystem