#############################
8.9 Run LLM Locally in Docker
#############################

=====================
Why Run LLMs Locally?
=====================

Running Large Language Models (LLMs) locally offers several advantages for developers and organizations:

**Privacy & Security**

- Your data never leaves your infrastructure
- No risk of sensitive information being sent to third-party APIs
- Complete control over data handling and storage

**Cost Efficiency**

- No per-token charges or API fees
- Predictable resource costs
- Better for high-volume applications

**Performance & Reliability**

- No network latency or internet dependency
- Guaranteed availability
- Consistent response times

**Development & Testing**

- Experiment with different models freely
- Debug without external API limitations
- Create reproducible testing environments

**Compliance**

- Meet strict data governance requirements
- Maintain regulatory compliance (GDPR, HIPAA, etc.)
- Keep intellectual property secure

================
Method 1: Ollama
================

**What is Ollama?**

Ollama is a lightweight, extensible framework for running large language models locally. It provides:

- Simple command-line interface for model management
- Optimized inference engine with quantization support
- RESTful API for integration with applications
- Support for popular models (Llama, Mistral, CodeLlama, etc.)
- Automatic GPU acceleration when available

**Setting Up Ollama**

.. code-block:: bash

    # Run Ollama container with GPU support (if available)
    docker run -d --name ollama \
        --gpus all \
        -v ollama:/root/.ollama \
        -p 11434:11434 \
        ollama/ollama:latest

    # For systems without GPU, omit the --gpus flag
    # docker run -d --name ollama -v ollama:/root/.ollama -p 11434:11434 ollama/ollama:latest

    # Wait for the API to be ready
    curl -sf http://localhost:11434/api/version
    
    # Access the container to manage models
    docker exec -it ollama bash 
    
    # Pull and run a lightweight model (good for testing)
    ollama pull qwen2:1.5b
    ollama run qwen2:1.5b

**Finding and Choosing Models**

Ollama supports many popular models. You can:

1. **Browse available models**: Visit https://ollama.com/library
2. **Popular models by size**:
   
   - **Small (1-3B parameters)**: Good for basic tasks, fast inference
     
     - ``qwen2:1.5b`` - General purpose, multilingual
     - ``phi3:mini`` - Microsoft's efficient model
     - ``gemma2:2b`` - Google's lightweight model
   
   - **Medium (7-8B parameters)**: Balanced performance and speed
     
     - ``llama3:8b`` - Meta's general purpose model
     - ``mistral:7b`` - Excellent for code and reasoning
     - ``codellama:7b`` - Specialized for code generation
   
   - **Large (13B+ parameters)**: Best quality, requires more resources
     
     - ``llama3:70b`` - High-quality responses (requires significant RAM)
     - ``mixtral:8x7b`` - Mixture of experts architecture

3. **Check model requirements**:

.. code-block:: bash

    # List installed models
    ollama list
    
    # Get model information
    ollama show qwen2:1.5b

**Managing Models**

.. code-block:: bash

    # Pull a specific model
    ollama pull llama3:8b
    
    # Remove a model to save space
    ollama rm qwen2:1.5b
    
    # Update a model
    ollama pull llama3:8b  # Re-pulling updates the model

=========================
Testing Ollama Deployment
=========================

**Interactive Testing**

.. code-block:: bash

    # Start an interactive session
    ollama run qwen2:1.5b
    # Type your questions and press Enter
    # Use /bye to exit

**Command Line Testing**

.. code-block:: bash

    # Single prompt execution
    ollama run qwen2:1.5b "Write a 2-line poem about containers."

**API Testing**

.. code-block:: bash

    # Simple API call
    curl http://localhost:11434/api/generate -d '{
        "model": "qwen2:1.5b",
        "prompt": "Explain what a tensor is in one sentence.",
        "stream": false
    }'

.. code-block:: bash

    # Streaming response (real-time output)
    curl -N http://localhost:11434/api/generate -d '{
        "model": "qwen2:1.5b",
        "prompt": "Explain DevOps in simple terms.",
        "stream": true
    }'

**Integration Example (Python)**

.. code-block:: python

    import requests
    import json

    def ask_ollama(prompt, model="qwen2:1.5b"):
        response = requests.post("http://localhost:11434/api/generate", 
                               json={
                                   "model": model,
                                   "prompt": prompt,
                                   "stream": False
                               })
        return response.json()["response"]

    # Usage
    answer = ask_ollama("What is Docker?")
    print(answer)

======================================================
Method 2: Hugging Face Text Generation Inference (TGI)
======================================================

**What is TGI?**

Text Generation Inference (TGI) is Hugging Face's production-ready inference server for large language models. It offers:

- **Optimized Performance**: Highly optimized inference engine with batching and caching
- **Production Ready**: Built for high-throughput, low-latency serving
- **Quantization Support**: Built-in support for 4-bit and 8-bit quantization
- **Streaming**: Real-time token streaming for better user experience
- **Enterprise Features**: Metrics, health checks, and monitoring endpoints

**One-liner Deployment with TGI**

.. code-block:: bash

    # Set model and cache directory
    MODEL=Qwen/Qwen2-1.5B-Instruct
    VOL=$HOME/hf-cache
    
    docker run -d --name tgi --gpus all --shm-size 1g \
      -p 8080:80 -v "$VOL":/data \
      ghcr.io/huggingface/text-generation-inference:latest \
      --model-id $MODEL \
      --quantize eetq \
      --max-input-tokens 2048 --max-total-tokens 4096

**For Different Hardware Configurations**

.. code-block:: bash

    # For 4GB GPU (GTX 1050 Ti, RTX 3050)
    # Uses 4-bit quantization for extreme VRAM savings
    MODEL=Qwen/Qwen2-1.5B-Instruct
    docker run -d --name tgi --gpus all --shm-size 1g \
      -p 8080:80 -v "$HOME/hf-cache":/data \
      ghcr.io/huggingface/text-generation-inference:latest \
      --model-id $MODEL \
      --quantize bitsandbytes-nf4 \
      --max-input-tokens 1024 --max-total-tokens 2048

.. code-block:: bash

    # For CPU-only deployment (no GPU)
    MODEL=Qwen/Qwen2-1.5B-Instruct
    docker run -d --name tgi --shm-size 1g \
      -p 8080:80 -v "$HOME/hf-cache":/data \
      ghcr.io/huggingface/text-generation-inference:latest \
      --model-id $MODEL \
      --max-input-tokens 1024 --max-total-tokens 2048

**Using Private/Gated Models**

.. code-block:: bash

    # For models requiring authentication (get token from https://huggingface.co/settings/tokens)
    export HF_TOKEN=your_token_here
    
    MODEL=meta-llama/Llama-2-7b-chat-hf  # Example gated model
    docker run -d --name tgi --gpus all --shm-size 1g \
      -p 8080:80 -v "$HOME/hf-cache":/data \
      -e HF_TOKEN=$HF_TOKEN \
      ghcr.io/huggingface/text-generation-inference:latest \
      --model-id $MODEL \
      --quantize eetq

**Testing TGI Deployment**

.. code-block:: bash

    # Wait for the model to load (check logs)
    docker logs -f tgi
    
    # Health check
    curl http://localhost:8080/health

    # Get model info
    curl http://localhost:8080/info

**API Usage Examples**

.. code-block:: bash

    # Simple completion (non-streaming)
    curl -s http://localhost:8080/generate \
      -H 'Content-Type: application/json' \
      -d '{
        "inputs": "What is DevOps?",
        "parameters": {
          "max_new_tokens": 128,
          "temperature": 0.7
        },
        "stream": false
      }'

.. code-block:: bash

    # Streaming response (real-time tokens)
    curl -N http://localhost:8080/generate_stream \
      -H 'Content-Type: application/json' \
      -d '{
        "inputs": "Explain containers in simple terms:",
        "parameters": {
          "max_new_tokens": 200,
          "temperature": 0.8,
          "top_p": 0.9
        }
      }'

.. code-block:: bash

    # Chat completion format
    curl -s http://localhost:8080/generate \
      -H 'Content-Type: application/json' \
      -d '{
        "inputs": "<|im_start|>system\nYou are a helpful DevOps assistant.<|im_end|>\n<|im_start|>user\nWhat is Docker?<|im_end|>\n<|im_start|>assistant\n",
        "parameters": {
          "max_new_tokens": 150,
          "temperature": 0.7,
          "stop": ["<|im_end|>"]
        },
        "stream": false
      }'

**Method Comparison**

+------------------+-------------------+------------------+------------------+
| Aspect           | Ollama            | TGI              | Recommendation   |
+==================+===================+==================+==================+
| **Ease of Use**  | Excellent         | Good             | Ollama for       |
|                  | (CLI interface)   | (One command)    | beginners        |
+------------------+-------------------+------------------+------------------+
| **Performance**  | Very Good         | Excellent        | TGI for          |
|                  | (Optimized)       | (Production)     | production       |
+------------------+-------------------+------------------+------------------+
| **Model Variety**| Good              | Extensive        | TGI for          |
|                  | (Curated)         | (Any HF model)   | variety          |
+------------------+-------------------+------------------+------------------+
| **Quantization** | Built-in          | Advanced         | TGI for          |
|                  | (GGML/GGUF)       | (Multiple types) | optimization     |
+------------------+-------------------+------------------+------------------+
| **Production**   | Good              | Excellent        | TGI for          |
| **Features**     | (Basic API)       | (Full featured)  | enterprise       |
+------------------+-------------------+------------------+------------------+
| **Resource**     | Lower             | Medium           | Ollama for       |
| **Usage**        | (Efficient)       | (Optimized)      | limited hardware |
+------------------+-------------------+------------------+------------------+
| **Streaming**    | Yes               | Yes              | Both support     |
|                  | (Built-in)        | (Built-in)       | real-time apps   |
+------------------+-------------------+------------------+------------------+

**When to Use Each Method:**

- **Ollama**: Perfect for learning, experimentation, simple applications, and resource-constrained environments
- **TGI**: Best for production deployments, high performance requirements, enterprise use, and access to the full Hugging Face model ecosystem