############################ 8.8 Container Best Practices ############################ .. image:: ../diagrams/containers.png :alt: A diagram showing container security layers and best practices :width: 800 px **From Working to World-Class** You can run containers. You can build images. You can orchestrate applications. Now comes the crucial transformation: turning your container knowledge into production-ready expertise that enterprises depend on. This section distills hard-won lessons from thousands of production deployments, security incidents, and performance optimizations. These aren't theoretical guidelines - they're battle-tested practices that prevent outages, security breaches, and operational headaches. =================== Learning Objectives =================== By the end of this section, you will: • **Implement** container security hardening that passes enterprise audits • **Optimize** images for size, performance, and reliability • **Design** production-ready container architectures • **Monitor** container performance and troubleshoot issues effectively • **Automate** security scanning and compliance checking • **Apply** operational best practices for day-2 container management **Prerequisites:** Solid understanding of containers, Dockerfiles, and orchestration concepts ========================== Security Best Practices ========================== **The Container Security Model** Container security operates on multiple layers, often called "defense in depth": .. code-block:: text ┌─────────────────────────────────────┐ │ Application Layer │ ← Code vulnerabilities, secrets ├─────────────────────────────────────┤ │ Container Layer │ ← Image vulnerabilities, runtime config ├─────────────────────────────────────┤ │ Host OS Layer │ ← Kernel, system services ├─────────────────────────────────────┤ │ Infrastructure Layer │ ← Network, storage, compute └─────────────────────────────────────┘ **1. Secure Base Images** **Use Minimal Base Images:** .. code-block:: dockerfile # EXCELLENT: Distroless (no shell, minimal attack surface) FROM gcr.io/distroless/python3 # GOOD: Alpine Linux (minimal, security-focused) FROM python:3.11-alpine # ACCEPTABLE: Slim images (smaller than full images) FROM python:3.11-slim # AVOID: Full images (unnecessary packages, larger attack surface) FROM python:3.11 # Contains compilers, debuggers, etc. **Pin Specific Versions:** .. code-block:: dockerfile # GOOD: Specific version with SHA256 hash FROM python:3.11.6-alpine@sha256:a5b78f3e2a63ce3b... # ACCEPTABLE: Specific semantic version FROM python:3.11.6-alpine # BAD: Moving tags FROM python:3.11-alpine # Could change FROM python:latest # Definitely will change **Scan Images for Vulnerabilities:** .. code-block:: bash # Using Trivy (free, comprehensive) trivy image python:3.11-alpine # Using Docker Scout (integrated with Docker) docker scout cves python:3.11-alpine # Using Snyk (commercial, detailed reporting) snyk test --docker python:3.11-alpine # Automate in CI/CD # Fail builds if critical vulnerabilities found trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:latest **2. Non-Root User Security** **Why Root is Dangerous:** .. code-block:: bash # If container escapes with root user: docker run --rm -it -v /:/host ubuntu:latest chroot /host # ^ This gives full host access if running as root **Secure User Implementation:** .. code-block:: dockerfile # Create dedicated user with specific UID/GID RUN groupadd -r appuser -g 1001 && \ useradd -r -g appuser -u 1001 -s /bin/false appuser # Create application directory and set ownership WORKDIR /app COPY --chown=appuser:appuser . . # Install dependencies as root, then switch RUN pip install -r requirements.txt # Switch to non-root user for runtime USER appuser # Verify (for debugging) RUN whoami # Should output: appuser **Advanced Security Hardening:** .. code-block:: dockerfile # Drop all capabilities, add only what's needed # Use with: docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE # Read-only root filesystem # Use with: docker run --read-only --tmpfs /tmp # No new privileges # Use with: docker run --security-opt=no-new-privileges # Use security profiles # Use with: docker run --security-opt=seccomp:seccomp-profile.json **3. Secrets Management** **Never Do This:** .. code-block:: dockerfile # NEVER: Secrets in images ENV API_KEY=sk-1234567890abcdef ENV DATABASE_PASSWORD=super_secret_password COPY api_keys.txt /app/ **Proper Secrets Handling:** .. code-block:: yaml # Docker Compose with external secrets version: '3.8' services: app: image: myapp:latest environment: - API_KEY_FILE=/run/secrets/api_key secrets: - api_key secrets: api_key: external: true # Managed outside compose .. code-block:: bash # Runtime secret injection docker run -e API_KEY="$(cat /secure/api_key)" myapp # Using init containers to fetch secrets # Kubernetes secret mounting # HashiCorp Vault integration ================== Image Optimization ================== **Size Optimization Strategies** **1. Multi-Stage Builds for Minimal Images:** .. code-block:: dockerfile # Build stage FROM node:18-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --only=production COPY . . RUN npm run build && npm prune --production # Runtime stage - significantly smaller FROM node:18-alpine AS runtime RUN addgroup -g 1001 -S nodejs && \ adduser -S nextjs -u 1001 WORKDIR /app # Copy only necessary files COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules COPY --from=builder --chown=nextjs:nodejs /app/package.json ./package.json USER nextjs CMD ["npm", "start"] **2. Layer Optimization:** .. code-block:: dockerfile # BAD: Creates multiple layers, poor caching RUN apt-get update RUN apt-get install -y curl RUN apt-get install -y wget RUN apt-get clean # GOOD: Single layer, better caching RUN apt-get update && \ apt-get install -y \ curl \ wget \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* **3. Dependency Management:** .. code-block:: dockerfile # Python: Use wheels and no-cache RUN pip install --no-cache-dir --find-links wheels -r requirements.txt # Node.js: Clean npm cache RUN npm ci --only=production && npm cache clean --force # Go: Use modules and static linking RUN go mod download && \ CGO_ENABLED=0 go build -ldflags="-w -s" -o app **4. Use .dockerignore:** .. code-block:: text # .dockerignore .git .gitignore README.md Dockerfile .dockerignore node_modules .env .env.local coverage/ .nyc_output target/ .pytest_cache __pycache__ **Image Size Comparison:** .. code-block:: bash # Analyze image layers docker history myapp:latest # Compare sizes docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" # Dive tool for detailed analysis dive myapp:latest ========================== Performance Best Practices ========================== **Resource Management** **Memory and CPU Limits:** .. code-block:: yaml # Docker Compose services: web: image: myapp:latest deploy: resources: limits: memory: 512M cpus: '1.0' reservations: memory: 256M cpus: '0.5' .. code-block:: bash # Docker run docker run -m 512m --cpus="1.0" myapp:latest **JVM Applications:** .. code-block:: dockerfile # Set heap size relative to container memory ENV JAVA_OPTS="-Xmx400m -Xms400m" # For 512MB container, leave ~100MB for non-heap **Health Checks for Reliability:** .. code-block:: dockerfile # Comprehensive health check HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1 .. code-block:: python # Health check endpoint implementation @app.route('/health') def health_check(): try: # Check database connection db.session.execute('SELECT 1') # Check external dependencies response = requests.get('http://api.service.com/ping', timeout=5) # Check resource usage memory_usage = psutil.virtual_memory().percent if memory_usage > 90: return {'status': 'unhealthy', 'reason': 'high_memory'}, 503 return {'status': 'healthy', 'timestamp': datetime.utcnow().isoformat()} except Exception as e: return {'status': 'unhealthy', 'error': str(e)}, 503 **Startup and Graceful Shutdown:** .. code-block:: dockerfile # Use exec form for proper signal handling CMD ["gunicorn", "--bind", "0.0.0.0:8000", "app:app"] .. code-block:: python # Graceful shutdown handling import signal import sys def signal_handler(sig, frame): print('Gracefully shutting down...') # Close database connections # Finish processing current requests # Clean up resources sys.exit(0) signal.signal(signal.SIGINT, signal_handler) signal.signal(signal.SIGTERM, signal_handler) ====================== Monitoring and Logging ====================== **Structured Logging** .. code-block:: python import logging import json from datetime import datetime class StructuredLogger: def __init__(self): self.logger = logging.getLogger(__name__) handler = logging.StreamHandler() handler.setFormatter(self.JSONFormatter()) self.logger.addHandler(handler) self.logger.setLevel(logging.INFO) class JSONFormatter(logging.Formatter): def format(self, record): log_data = { 'timestamp': datetime.utcnow().isoformat(), 'level': record.levelname, 'service': 'my-app', 'message': record.getMessage(), 'container_id': os.environ.get('HOSTNAME', 'unknown') } if hasattr(record, 'request_id'): log_data['request_id'] = record.request_id return json.dumps(log_data) **Container Metrics Collection:** .. code-block:: yaml # Docker Compose with monitoring version: '3.8' services: app: image: myapp:latest logging: driver: "json-file" options: max-size: "10m" max-file: "3" # Prometheus metrics collection prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml # Grafana for visualization grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin **Application Metrics:** .. code-block:: python from prometheus_client import Counter, Histogram, generate_latest REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['method', 'endpoint']) REQUEST_DURATION = Histogram('app_request_duration_seconds', 'Request duration') @REQUEST_DURATION.time() def process_request(): REQUEST_COUNT.labels(method='GET', endpoint='/api/users').inc() # Your application logic here ==================== Development Workflow ==================== **Efficient Development Setup** **Hot Reloading Configuration:** .. code-block:: yaml # docker-compose.dev.yml version: '3.8' services: web: build: context: . target: development volumes: - .:/app - /app/node_modules # Prevent overwriting environment: - NODE_ENV=development - CHOKIDAR_USEPOLLING=true # For file watching in containers **Testing in Containers:** .. code-block:: dockerfile # Multi-stage with test stage FROM python:3.11-slim AS base WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt FROM base AS test COPY requirements-test.txt . RUN pip install -r requirements-test.txt COPY . . CMD ["pytest", "-v"] FROM base AS production COPY . . CMD ["gunicorn", "app:app"] .. code-block:: bash # Run tests in container docker build --target test -t myapp:test . docker run --rm myapp:test **CI/CD Integration:** .. code-block:: yaml # .github/workflows/container.yml name: Container CI/CD on: [push, pull_request] jobs: security-scan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Build image run: docker build -t myapp:${{ github.sha }} . - name: Security scan run: | docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \ aquasec/trivy:latest image --exit-code 1 --severity HIGH,CRITICAL \ myapp:${{ github.sha }} test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run tests run: | docker build --target test -t myapp:test . docker run --rm myapp:test ===================== Production Deployment ===================== **Zero-Downtime Deployments** **Rolling Updates with Health Checks:** .. code-block:: yaml # Kubernetes deployment example apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 template: spec: containers: - name: app image: myapp:v1.2.3 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 **Blue-Green Deployments:** .. code-block:: bash # Blue-green deployment script #!/bin/bash NEW_VERSION=$1 # Deploy new version to green environment docker service create --name myapp-green --replicas 3 myapp:$NEW_VERSION # Wait for health checks while ! curl -f http://green.example.com/health; do sleep 5 done # Switch traffic (update load balancer) update_load_balancer_to_green # Remove old blue environment docker service rm myapp-blue # Rename green to blue for next deployment docker service update --name myapp-blue myapp-green **Backup and Recovery:** .. code-block:: yaml # Backup strategy for stateful services services: postgres: image: postgres:15 volumes: - postgres_data:/var/lib/postgresql/data - ./backups:/backups environment: - POSTGRES_DB=myapp backup: image: postgres:15 volumes: - postgres_data:/var/lib/postgresql/data:ro - ./backups:/backups command: | sh -c " while true; do pg_dump -h postgres -U postgres myapp > /backups/backup_$(date +%Y%m%d_%H%M%S).sql sleep 3600 done " ============================ Security Scanning Automation ============================ **Implementing Security Gates** .. code-block:: bash #!/bin/bash # security-scan.sh IMAGE_NAME=$1 echo "Running security scan on $IMAGE_NAME..." # Vulnerability scan trivy image --format json --output scan-results.json $IMAGE_NAME # Check for critical vulnerabilities CRITICAL_VULN=$(jq '.Results[]?.Vulnerabilities[]? | select(.Severity == "CRITICAL") | length' scan-results.json | wc -l) if [ "$CRITICAL_VULN" -gt 0 ]; then echo " Critical vulnerabilities found. Deployment blocked." exit 1 fi # License compliance check docker run --rm -v $(pwd):/workspace fossa/fossa analyze # Secret detection docker run --rm -v $(pwd):/workspace trufflesecurity/trufflehog:latest filesystem /workspace echo " Security scan passed." **Policy as Code:** .. code-block:: yaml # Open Policy Agent (OPA) policy package docker.security deny[msg] { input.User == "root" msg := "Container cannot run as root user" } deny[msg] { input.Image.tag == "latest" msg := "Image must use specific version tags, not 'latest'" } deny[msg] { not input.HealthCheck msg := "Container must define a health check" } ===================== Troubleshooting Guide ===================== **Common Production Issues** **Container Won't Start:** .. code-block:: bash # Check container logs docker logs container_name # Run interactively to debug docker run -it --entrypoint /bin/sh image_name # Check resource constraints docker stats docker system df **Performance Issues:** .. code-block:: bash # Monitor resource usage docker stats --no-stream # Check container processes docker exec container_name ps aux # Memory analysis docker exec container_name cat /proc/meminfo # Check for memory leaks docker exec container_name pmap -x 1 **Network Issues:** .. code-block:: bash # Test connectivity between containers docker exec container1 ping container2 # Check DNS resolution docker exec container_name nslookup service_name # Inspect network configuration docker network inspect network_name **Storage Issues:** .. code-block:: bash # Check disk usage docker system df # Clean up unused resources docker system prune -a # Check volume mounts docker inspect container_name | jq '.[].Mounts' ====================== Operational Excellence ====================== **Day-2 Operations Checklist** **Daily Tasks:** - Monitor container health and resource usage - Review security scan results - Check backup completion - Monitor application metrics and alerts **Weekly Tasks:** - Update base images for security patches - Review and clean up unused images/containers - Performance baseline comparison - Security policy compliance audit **Monthly Tasks:** - Disaster recovery testing - Capacity planning review - Security training and awareness - Tool and process optimization **Automated Monitoring:** .. code-block:: yaml # Comprehensive monitoring stack version: '3.8' services: # Log aggregation loki: image: grafana/loki:latest ports: - "3100:3100" # Metrics collection prometheus: image: prom/prometheus:latest ports: - "9090:9090" # Alerting alertmanager: image: prom/alertmanager:latest ports: - "9093:9093" # Visualization grafana: image: grafana/grafana:latest ports: - "3000:3000" =============== Future-Proofing =============== **Container Technology Evolution** **WebAssembly (WASM) Containers:** - Smaller, faster, more secure than traditional containers - Language-agnostic runtime - Better isolation and portability **Confidential Computing:** - Hardware-encrypted container execution - Protection against privileged access attacks - Secure multi-party computation **GitOps and Infrastructure as Code:** - Declarative container configuration - Version-controlled infrastructure - Automated drift detection and correction ============ What's Next? ============ You now have the knowledge to run containers securely and efficiently in production. These practices form the foundation for scaling to Kubernetes, implementing microservices architectures, and building robust cloud-native applications. **Key takeaways:** - Security is multi-layered and must be built in from the start - Image optimization reduces costs and improves performance - Monitoring and logging are essential for production operations - Automation prevents human error and improves reliability - Best practices evolve - stay current with the container ecosystem .. warning:: **Continuous Learning:** Container technology evolves rapidly. Join the community, follow security advisories, and regularly update your practices. What's secure today may be vulnerable tomorrow.