10.9 Ansible Production Patterns
Learning Objectives
By the end of this chapter, you will be able to:
Design enterprise-scale Ansible architectures for production environments
Integrate Ansible seamlessly with Terraform workflows for complete IaC solutions
Implement robust CI/CD pipelines using Ansible for automated deployments
Apply security best practices and compliance frameworks in Ansible automation
Monitor and observe Ansible automation with logging, metrics, and alerting
Scale Ansible operations across large, distributed infrastructures
Troubleshoot complex production issues and implement reliable error recovery
Establish governance and organizational patterns for team collaboration
Optimize Ansible workflows for performance, reliability, and maintainability
Prerequisites: Mastery of Ansible core concepts (Chapters 10.6-10.8) and understanding of Terraform from previous chapters. Experience with production systems recommended.
Enterprise Focus: This chapter focuses on real-world, production-ready patterns used in enterprise environments with hundreds or thousands of managed nodes.
Enterprise Ansible Architecture
Production Ansible deployments require careful architectural planning to ensure scalability, reliability, and maintainability across large organizations.
Centralized Control Architecture:
Enterprise Ansible Architecture:
┌─────────────────────────────────────────────────────┐
│ Control Plane │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Ansible │ │ Ansible │ │ Ansible │ │
│ │ Controller │ │ AWX/Tower │ │ Semaphore │ │
│ │ (Primary) │ │ (Secondary)│ │ (Backup) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ┌──────────────────────────────────────────────────┤
│ │ Shared Storage │
│ │ • Playbooks & Roles Repository │
│ │ • Inventory Management │
│ │ • Secrets & Vault Files │
│ │ • Execution Logs & Artifacts │
│ └──────────────────────────────────────────────────│
└─────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────┐
│ Network Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ DMZ │ │ Private │ │ Cloud │ │
│ │ Bastion │ │ Network │ │ Provider │ │
│ │ Hosts │ │ Subnets │ │ APIs │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────┐
│ Managed Nodes │
│ │
│ Production │ Staging │ Development │
│ ┌─────────┐ │ ┌─────────┐ │ ┌─────────┐ │
│ │ Web │ │ │ Web │ │ │ Web │ │
│ │ Servers │ │ │ Servers │ │ │ Servers │ │
│ ├─────────┤ │ ├─────────┤ │ ├─────────┤ │
│ │ DB │ │ │ DB │ │ │ DB │ │
│ │ Servers │ │ │ Servers │ │ │ Servers │ │
│ ├─────────┤ │ ├─────────┤ │ ├─────────┤ │
│ │ Cache │ │ │ Cache │ │ │ Cache │ │
│ │ Servers │ │ │ Servers │ │ │ Servers │ │
│ └─────────┘ │ └─────────┘ │ └─────────┘ │
└─────────────────────────────────────────────────────┘
Directory Structure for Enterprise:
ansible-infrastructure/
├── ansible.cfg # Global Ansible configuration
├── requirements.yml # External collections and roles
├── vault-password-files/ # Vault password management
│ ├── production
│ ├── staging
│ └── development
├── inventories/ # Environment-specific inventories
│ ├── production/
│ │ ├── hosts.yml # Static inventory
│ │ ├── gcp.yml # Dynamic GCP inventory
│ │ ├── group_vars/ # Group variables
│ │ └── host_vars/ # Host-specific variables
│ ├── staging/
│ └── development/
├── playbooks/ # Main orchestration playbooks
│ ├── site.yml # Main site playbook
│ ├── deploy.yml # Application deployment
│ ├── maintenance.yml # Maintenance procedures
│ └── disaster-recovery.yml # DR procedures
├── roles/ # Custom roles
│ ├── common/ # Base system configuration
│ ├── security/ # Security hardening
│ ├── monitoring/ # Observability setup
│ ├── database/ # Database management
│ ├── webserver/ # Web server configuration
│ └── application/ # Application deployment
├── collections/ # Local collections
├── filter_plugins/ # Custom filters
├── callback_plugins/ # Custom callbacks
├── library/ # Custom modules
├── scripts/ # Helper scripts
├── tests/ # Automated testing
└── docs/ # Documentation
Environment-Specific Configuration:
# inventories/production/group_vars/all/main.yml
---
# Environment identification
environment: production
environment_tier: prod
# Global settings
ansible_user: ansible-prod
ansible_ssh_private_key_file: ~/.ssh/production-key
ansible_become: true
ansible_become_method: sudo
# Networking
dns_servers:
- 8.8.8.8
- 8.8.4.4
ntp_servers:
- time1.google.com
- time2.google.com
# Security settings
security_hardening: true
audit_logging: true
compliance_mode: true
# Monitoring and observability
monitoring_enabled: true
log_aggregation: true
metrics_collection: true
# Application settings
app_environment: production
debug_mode: false
log_level: warn
# Resource limits
max_connections: 1000
memory_limit: "4GB"
disk_threshold: 85
Terraform + Ansible Integration Workflows
The most powerful IaC implementations combine Terraform for infrastructure provisioning with Ansible for configuration management in automated workflows.
Complete Infrastructure + Configuration Workflow:
#!/bin/bash
# deploy-infrastructure.sh - Complete deployment script
set -euo pipefail
# Configuration
ENVIRONMENT=${1:-staging}
TERRAFORM_DIR="./terraform/${ENVIRONMENT}"
ANSIBLE_DIR="./ansible"
echo " Starting deployment for environment: ${ENVIRONMENT}"
# Step 1: Provision infrastructure with Terraform
echo " Provisioning infrastructure..."
cd "${TERRAFORM_DIR}"
terraform init -upgrade
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
# Step 2: Extract Terraform outputs for Ansible
echo " Extracting infrastructure information..."
terraform output -json > ../outputs/${ENVIRONMENT}.json
# Step 3: Generate Ansible inventory from Terraform state
cd "${ANSIBLE_DIR}"
python3 scripts/terraform-to-inventory.py \
--terraform-output="../terraform/outputs/${ENVIRONMENT}.json" \
--output="inventories/${ENVIRONMENT}/terraform.yml"
# Step 4: Wait for instances to be ready
echo " Waiting for instances to be accessible..."
ansible-playbook \
-i "inventories/${ENVIRONMENT}" \
playbooks/wait-for-ready.yml \
--extra-vars "environment=${ENVIRONMENT}"
# Step 5: Configure infrastructure with Ansible
echo " Configuring infrastructure..."
ansible-playbook \
-i "inventories/${ENVIRONMENT}" \
site.yml \
--extra-vars "environment=${ENVIRONMENT}" \
--vault-password-file="vault-password-files/${ENVIRONMENT}"
# Step 6: Deploy applications
echo " Deploying applications..."
ansible-playbook \
-i "inventories/${ENVIRONMENT}" \
playbooks/deploy.yml \
--extra-vars "environment=${ENVIRONMENT}" \
--vault-password-file="vault-password-files/${ENVIRONMENT}"
# Step 7: Run post-deployment tests
echo " Running post-deployment tests..."
ansible-playbook \
-i "inventories/${ENVIRONMENT}" \
playbooks/test.yml \
--extra-vars "environment=${ENVIRONMENT}"
echo " Deployment completed successfully!"
Terraform Output Integration:
# terraform/outputs.tf
output "ansible_inventory" {
description = "Ansible inventory data"
value = {
web_servers = {
hosts = {
for instance in google_compute_instance.web_servers :
instance.name => {
ansible_host = instance.network_interface[0].access_config[0].nat_ip
internal_ip = instance.network_interface[0].network_ip
zone = instance.zone
instance_id = instance.instance_id
machine_type = instance.machine_type
labels = instance.labels
}
}
vars = {
role = "webserver"
nginx_port = var.nginx_port
ssl_enabled = var.ssl_enabled
backup_enabled = true
}
}
databases = {
hosts = {
for instance in google_compute_instance.databases :
instance.name => {
ansible_host = instance.network_interface[0].access_config[0].nat_ip
internal_ip = instance.network_interface[0].network_ip
zone = instance.zone
instance_id = instance.instance_id
}
}
vars = {
role = "database"
mysql_port = var.mysql_port
backup_schedule = var.backup_schedule
replication_mode = var.replication_mode
}
}
load_balancers = {
hosts = {
for lb in google_compute_instance.load_balancers :
lb.name => {
ansible_host = lb.network_interface[0].access_config[0].nat_ip
internal_ip = lb.network_interface[0].network_ip
}
}
vars = {
role = "loadbalancer"
backend_servers = [
for instance in google_compute_instance.web_servers :
instance.network_interface[0].network_ip
]
}
}
}
}
# Output network information for Ansible templates
output "network_info" {
value = {
vpc_name = google_compute_network.main.name
subnet_cidr = google_compute_subnetwork.main.ip_cidr_range
firewall_rules = [
for rule in google_compute_firewall.rules :
{
name = rule.name
ports = rule.allow[0].ports
ranges = rule.source_ranges
}
]
}
}
Inventory Generation Script:
#!/usr/bin/env python3
# scripts/terraform-to-inventory.py
import json
import yaml
import argparse
from pathlib import Path
def terraform_to_ansible_inventory(terraform_output: dict) -> dict:
"""Convert Terraform output to Ansible inventory format"""
inventory = {
'all': {
'children': {}
},
'_meta': {
'hostvars': {}
}
}
# Extract Terraform outputs
tf_inventory = terraform_output.get('ansible_inventory', {}).get('value', {})
network_info = terraform_output.get('network_info', {}).get('value', {})
# Process each group from Terraform output
for group_name, group_data in tf_inventory.items():
inventory['all']['children'][group_name] = {
'hosts': list(group_data['hosts'].keys()),
'vars': group_data.get('vars', {})
}
# Add network information to group vars
if network_info:
inventory['all']['children'][group_name]['vars'].update({
'vpc_name': network_info.get('vpc_name'),
'subnet_cidr': network_info.get('subnet_cidr'),
'firewall_rules': network_info.get('firewall_rules', [])
})
# Add host variables
for hostname, host_vars in group_data['hosts'].items():
inventory['_meta']['hostvars'][hostname] = host_vars
return inventory
def main():
parser = argparse.ArgumentParser(description='Convert Terraform output to Ansible inventory')
parser.add_argument('--terraform-output', required=True, help='Path to Terraform JSON output file')
parser.add_argument('--output', required=True, help='Path to output Ansible inventory file')
args = parser.parse_args()
# Read Terraform output
with open(args.terraform_output, 'r') as f:
terraform_data = json.load(f)
# Convert to Ansible inventory
inventory = terraform_to_ansible_inventory(terraform_data)
# Write Ansible inventory
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w') as f:
yaml.dump(inventory, f, default_flow_style=False, sort_keys=False)
print(f"✅ Generated Ansible inventory: {output_path}")
if __name__ == '__main__':
main()
Integration Playbook:
---
# playbooks/terraform-integration.yml
- name: Terraform + Ansible Integration
hosts: localhost
gather_facts: false
vars:
terraform_dir: "../terraform/{{ environment }}"
tasks:
- name: Check Terraform state
stat:
path: "{{ terraform_dir }}/terraform.tfstate"
register: tf_state
- name: Fail if Terraform state not found
fail:
msg: "Terraform state not found. Run terraform apply first."
when: not tf_state.stat.exists
- name: Get Terraform outputs
shell: terraform output -json
args:
chdir: "{{ terraform_dir }}"
register: tf_outputs
changed_when: false
- name: Parse Terraform outputs
set_fact:
terraform_data: "{{ tf_outputs.stdout | from_json }}"
- name: Display infrastructure summary
debug:
msg: |
Infrastructure Summary:
- Web Servers: {{ terraform_data.ansible_inventory.value.web_servers.hosts | length }}
- Databases: {{ terraform_data.ansible_inventory.value.databases.hosts | length }}
- Load Balancers: {{ terraform_data.ansible_inventory.value.load_balancers.hosts | length }}
- VPC: {{ terraform_data.network_info.value.vpc_name }}
- name: Wait for all instances to be accessible
wait_for:
host: "{{ hostvars[item]['ansible_host'] }}"
port: 22
timeout: 300
loop: "{{ groups['all'] }}"
when: hostvars[item]['ansible_host'] is defined
CI/CD Pipeline Integration
Modern DevOps requires Ansible to integrate seamlessly with continuous integration and deployment pipelines.
GitHub Actions Workflow:
# .github/workflows/infrastructure-deployment.yml
name: Infrastructure Deployment
on:
push:
branches: [main]
paths:
- 'terraform/**'
- 'ansible/**'
pull_request:
branches: [main]
paths:
- 'terraform/**'
- 'ansible/**'
env:
ENVIRONMENT: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
jobs:
validate:
name: Validate Infrastructure Code
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install Ansible
run: |
pip install ansible google-auth requests
ansible-galaxy collection install google.cloud community.general
- name: Validate Terraform
run: |
cd terraform/${{ env.ENVIRONMENT }}
terraform init -backend=false
terraform validate
terraform fmt -check
- name: Lint Ansible
run: |
cd ansible
ansible-lint site.yml
ansible-playbook --syntax-check site.yml
- name: Test Ansible roles
run: |
cd ansible
molecule test
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: validate
if: github.event_name == 'pull_request'
environment: staging
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure GCP credentials
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0
- name: Deploy infrastructure
run: |
cd terraform/staging
terraform init
terraform plan
terraform apply -auto-approve
- name: Setup Ansible
run: |
pip install ansible google-auth
ansible-galaxy collection install google.cloud
- name: Configure infrastructure
run: |
cd ansible
ansible-playbook \
-i inventories/staging \
site.yml \
--vault-password-file <(echo "${{ secrets.ANSIBLE_VAULT_PASSWORD }}")
- name: Run integration tests
run: |
cd ansible
ansible-playbook \
-i inventories/staging \
playbooks/test.yml
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: validate
if: github.ref == 'refs/heads/main'
environment: production
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure GCP credentials
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}
- name: Deploy with approval
run: |
echo "Deploying to production requires manual approval"
./scripts/deploy-infrastructure.sh production
env:
VAULT_PASSWORD: ${{ secrets.ANSIBLE_VAULT_PASSWORD_PROD }}
GitLab CI/CD Pipeline:
# .gitlab-ci.yml
stages:
- validate
- test
- deploy-staging
- deploy-production
variables:
TERRAFORM_VERSION: "1.5.0"
ANSIBLE_VERSION: "6.0.0"
before_script:
- apt-get update -qq && apt-get install -y -qq python3-pip
- pip3 install ansible==$ANSIBLE_VERSION google-auth requests
- ansible-galaxy collection install google.cloud community.general
validate-terraform:
stage: validate
image: hashicorp/terraform:$TERRAFORM_VERSION
script:
- cd terraform/staging
- terraform init -backend=false
- terraform validate
- terraform fmt -check
only:
changes:
- terraform/**/*
validate-ansible:
stage: validate
image: python:3.9
script:
- cd ansible
- ansible-lint site.yml
- ansible-playbook --syntax-check site.yml
only:
changes:
- ansible/**/*
test-ansible-roles:
stage: test
image: python:3.9
services:
- docker:dind
script:
- pip3 install molecule[docker] docker
- cd ansible
- molecule test
only:
changes:
- ansible/roles/**/*
deploy-staging:
stage: deploy-staging
image: python:3.9
environment:
name: staging
url: https://staging.example.com
script:
- ./scripts/deploy-infrastructure.sh staging
only:
- merge_requests
when: manual
deploy-production:
stage: deploy-production
image: python:3.9
environment:
name: production
url: https://example.com
script:
- ./scripts/deploy-infrastructure.sh production
only:
- main
when: manual
Testing Integration:
# ansible/playbooks/test.yml
---
- name: Post-deployment testing
hosts: all
gather_facts: true
tasks:
- name: Test SSH connectivity
ping:
tags: [connectivity]
- name: Verify required services are running
systemd:
name: "{{ item }}"
state: started
check_mode: yes
register: service_status
failed_when: service_status.failed
loop:
- nginx
- mysql
- redis
tags: [services]
- name: Test web server response
uri:
url: "https://{{ ansible_host }}"
method: GET
status_code: 200
timeout: 10
delegate_to: localhost
when: "'web_servers' in group_names"
tags: [web]
- name: Test database connectivity
mysql_db:
name: "{{ app_database_name }}"
state: present
check_mode: yes
when: "'databases' in group_names"
tags: [database]
- name: Verify SSL certificates
openssl_certificate:
path: "/etc/ssl/certs/{{ domain_name }}.crt"
provider: assertonly
valid_in: 2592000 # 30 days
when: ssl_enabled | default(false)
tags: [ssl]
Security and Compliance
Production Ansible deployments must implement comprehensive security and compliance measures.
Security Hardening Playbook:
---
# roles/security/tasks/main.yml
- name: Apply security hardening
hosts: all
become: true
vars:
security_audit: true
compliance_standard: "CIS" # CIS, SOC2, PCI-DSS
tasks:
- name: Include OS-specific security tasks
include_tasks: "{{ ansible_os_family | lower }}.yml"
- name: Configure SSH hardening
include_tasks: ssh-hardening.yml
tags: [ssh, security]
- name: Configure firewall rules
include_tasks: firewall.yml
tags: [firewall, security]
- name: Configure audit logging
include_tasks: audit.yml
tags: [audit, compliance]
- name: Install security tools
include_tasks: security-tools.yml
tags: [tools, security]
- name: Configure file permissions
include_tasks: file-permissions.yml
tags: [permissions, security]
SSH Hardening Tasks:
# roles/security/tasks/ssh-hardening.yml
---
- name: Configure secure SSH settings
lineinfile:
path: /etc/ssh/sshd_config
regexp: "{{ item.regexp }}"
line: "{{ item.line }}"
backup: yes
loop:
- { regexp: '^#?PermitRootLogin', line: 'PermitRootLogin no' }
- { regexp: '^#?PasswordAuthentication', line: 'PasswordAuthentication no' }
- { regexp: '^#?PubkeyAuthentication', line: 'PubkeyAuthentication yes' }
- { regexp: '^#?Protocol', line: 'Protocol 2' }
- { regexp: '^#?X11Forwarding', line: 'X11Forwarding no' }
- { regexp: '^#?MaxAuthTries', line: 'MaxAuthTries 3' }
- { regexp: '^#?ClientAliveInterval', line: 'ClientAliveInterval 300' }
- { regexp: '^#?ClientAliveCountMax', line: 'ClientAliveCountMax 2' }
- { regexp: '^#?LoginGraceTime', line: 'LoginGraceTime 60' }
notify: restart sshd
- name: Configure SSH allowed users
lineinfile:
path: /etc/ssh/sshd_config
line: "AllowUsers {{ ssh_allowed_users | join(' ') }}"
regexp: '^AllowUsers'
when: ssh_allowed_users is defined
notify: restart sshd
- name: Configure SSH key algorithms
blockinfile:
path: /etc/ssh/sshd_config
block: |
# Strong crypto
KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512
Ciphers aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes256-ctr
MACs hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha2-256,hmac-sha2-512
marker: "# {mark} ANSIBLE MANAGED BLOCK - SSH CRYPTO"
notify: restart sshd
Compliance Reporting:
# roles/security/tasks/compliance-report.yml
---
- name: Generate compliance report
block:
- name: Check SSH configuration compliance
shell: |
sshd -T | grep -E "(permitrootlogin|passwordauthentication|protocol|maxauthtries)"
register: ssh_compliance
changed_when: false
- name: Check firewall status
systemd:
name: ufw
state: started
check_mode: yes
register: firewall_status
failed_when: false
- name: Check audit daemon
systemd:
name: auditd
state: started
check_mode: yes
register: audit_status
failed_when: false
- name: Generate compliance report
template:
src: compliance-report.j2
dest: "/var/log/compliance-{{ ansible_date_time.iso8601_basic_short }}.json"
vars:
compliance_data:
hostname: "{{ inventory_hostname }}"
timestamp: "{{ ansible_date_time.iso8601 }}"
ssh_config: "{{ ssh_compliance.stdout_lines }}"
firewall_enabled: "{{ firewall_status.state == 'started' }}"
audit_enabled: "{{ audit_status.state == 'started' }}"
os_info:
distribution: "{{ ansible_distribution }}"
version: "{{ ansible_distribution_version }}"
kernel: "{{ ansible_kernel }}"
- name: Upload compliance report
uri:
url: "{{ compliance_reporting_url }}"
method: POST
headers:
Authorization: "Bearer {{ compliance_api_token }}"
body_format: json
body: "{{ compliance_data }}"
when: compliance_reporting_url is defined
Secrets Management with Vault:
# group_vars/production/vault.yml (encrypted)
---
vault_database_credentials:
master_password: "super_secure_master_password"
replication_password: "replication_password_123"
backup_password: "backup_password_456"
vault_ssl_certificates:
private_key: |
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDXyZ...
-----END PRIVATE KEY-----
certificate: |
-----BEGIN CERTIFICATE-----
MIIDXTCCAkWgAwIBAgIJAKoK/heBjcOuMA0GCSqGSIb3DQEBBQUAMEUx...
-----END CERTIFICATE-----
vault_api_keys:
monitoring: "mon_1234567890abcdef"
logging: "log_abcdef1234567890"
backup: "bak_567890abcdef1234"
vault_service_accounts:
gcp_service_account: |
{
"type": "service_account",
"project_id": "my-project",
"private_key_id": "key123",
"private_key": "-----BEGIN PRIVATE KEY-----\n...",
"client_email": "service@my-project.iam.gserviceaccount.com",
"client_id": "123456789",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token"
}
Monitoring and Observability
Production Ansible environments require comprehensive monitoring and observability to ensure reliability and performance.
Ansible Callback Plugin for Monitoring:
# callback_plugins/monitoring.py
from ansible.plugins.callback import CallbackBase
import json
import requests
import time
from datetime import datetime
class CallbackModule(CallbackBase):
"""Send Ansible execution metrics to monitoring system"""
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'notification'
CALLBACK_NAME = 'monitoring'
def __init__(self):
super(CallbackModule, self).__init__()
self.start_time = None
self.monitoring_url = "https://monitoring.example.com/api/ansible"
self.api_token = "your-monitoring-api-token"
def v2_playbook_on_start(self, playbook):
self.start_time = time.time()
self.playbook_name = playbook._file_name
# Send playbook start event
self._send_metric({
'event': 'playbook_started',
'playbook': self.playbook_name,
'timestamp': datetime.utcnow().isoformat(),
'environment': self._get_var('environment')
})
def v2_playbook_on_stats(self, stats):
end_time = time.time()
duration = end_time - self.start_time
# Collect execution statistics
summary = {}
for host in stats.processed:
summary[host] = stats.summarize(host)
# Send completion metrics
self._send_metric({
'event': 'playbook_completed',
'playbook': self.playbook_name,
'duration': duration,
'summary': summary,
'timestamp': datetime.utcnow().isoformat(),
'environment': self._get_var('environment')
})
def v2_runner_on_failed(self, result, ignore_errors=False):
# Send failure alerts
self._send_metric({
'event': 'task_failed',
'playbook': self.playbook_name,
'host': result._host.get_name(),
'task': result._task.get_name(),
'error': str(result._result.get('msg', 'Unknown error')),
'timestamp': datetime.utcnow().isoformat(),
'environment': self._get_var('environment')
})
def _send_metric(self, data):
try:
requests.post(
self.monitoring_url,
json=data,
headers={'Authorization': f'Bearer {self.api_token}'},
timeout=5
)
except Exception as e:
self._display.warning(f"Failed to send monitoring data: {e}")
def _get_var(self, var_name):
return getattr(self, f'_options', {}).get(var_name, 'unknown')
Logging Configuration:
# roles/monitoring/tasks/logging.yml
---
- name: Configure centralized logging
block:
- name: Install logging agent
package:
name: "{{ logging_agent_package }}"
state: present
- name: Configure log forwarding
template:
src: "{{ logging_config_template }}"
dest: "{{ logging_config_path }}"
backup: yes
notify: restart logging agent
vars:
log_server: "{{ centralized_log_server }}"
log_port: "{{ centralized_log_port }}"
environment: "{{ environment }}"
service_name: "{{ inventory_hostname }}"
- name: Configure Ansible log rotation
template:
src: ansible-logrotate.j2
dest: /etc/logrotate.d/ansible
vars:
log_retention_days: 30
max_log_size: "100M"
- name: Create Ansible log directory
file:
path: /var/log/ansible
state: directory
owner: ansible
group: ansible
mode: '0750'
- name: Configure Ansible logging
lineinfile:
path: /etc/ansible/ansible.cfg
regexp: '^#?log_path'
line: 'log_path = /var/log/ansible/ansible.log'
create: yes
Performance Monitoring Playbook:
# playbooks/performance-monitoring.yml
---
- name: Monitor Ansible performance
hosts: localhost
gather_facts: false
tasks:
- name: Start performance monitoring
debug:
msg: "Starting performance monitoring for {{ ansible_play_name }}"
- name: Record start time
set_fact:
monitoring_start_time: "{{ ansible_date_time.epoch }}"
- name: Monitor system resources during execution
shell: |
top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1
register: cpu_usage
changed_when: false
- name: Monitor memory usage
shell: |
free | grep Mem | awk '{printf "%.2f", ($3/$2) * 100.0}'
register: memory_usage
changed_when: false
- name: Record resource usage
set_fact:
resource_metrics:
cpu_usage: "{{ cpu_usage.stdout }}"
memory_usage: "{{ memory_usage.stdout }}"
timestamp: "{{ ansible_date_time.iso8601 }}"
- name: Send metrics to monitoring system
uri:
url: "{{ monitoring_api_url }}/metrics"
method: POST
headers:
Authorization: "Bearer {{ monitoring_api_token }}"
body_format: json
body:
service: "ansible"
environment: "{{ environment }}"
metrics: "{{ resource_metrics }}"
when: monitoring_api_url is defined
Scaling and Performance
Large-scale Ansible deployments require specific patterns and optimizations for performance and reliability.
High-Performance Ansible Configuration:
# ansible.cfg for production scale
[defaults]
# Increase parallel execution
forks = 100
# Optimize SSH connections
host_key_checking = False
ssh_args = -o ControlMaster=auto -o ControlPersist=600s -o UserKnownHostsFile=/dev/null
pipelining = True
# Performance optimizations
gathering = smart
fact_caching = redis
fact_caching_connection = redis-cluster.example.com:6379:0
fact_caching_timeout = 86400
# Reduce output verbosity in production
stdout_callback = minimal
bin_ansible_callbacks = True
# Connection settings
timeout = 30
# Retry settings
retry_files_enabled = True
retry_files_save_path = ~/.ansible-retry
[ssh_connection]
# SSH multiplexing
ssh_args = -o ControlMaster=auto -o ControlPersist=600s
control_path_dir = ~/.ansible/cp
control_path = %(directory)s/%%h-%%p-%%r
# Connection pooling
ssh_executable = /usr/bin/ssh
scp_if_ssh = smart
transfer_method = smart
Scaling Patterns:
# playbooks/scaled-deployment.yml
---
- name: Scaled deployment with batching
hosts: web_servers
serial: "25%" # Deploy to 25% of hosts at a time
max_fail_percentage: 10 # Allow 10% failure rate
pre_tasks:
- name: Remove host from load balancer
uri:
url: "{{ load_balancer_api }}/remove/{{ inventory_hostname }}"
method: POST
delegate_to: localhost
tasks:
- name: Deploy application
include_role:
name: application
vars:
app_version: "{{ new_app_version }}"
- name: Verify deployment
uri:
url: "http://{{ ansible_default_ipv4.address }}:8080/health"
status_code: 200
retries: 5
delay: 10
post_tasks:
- name: Add host back to load balancer
uri:
url: "{{ load_balancer_api }}/add/{{ inventory_hostname }}"
method: POST
delegate_to: localhost
Async Operations for Scale:
- name: Large-scale async operations
hosts: all
tasks:
- name: Start large file downloads asynchronously
get_url:
url: "{{ item.url }}"
dest: "{{ item.dest }}"
async: 1800 # 30 minutes timeout
poll: 0 # Fire and forget
loop: "{{ large_files }}"
register: download_jobs
- name: Continue with other tasks while downloads happen
package:
name: "{{ required_packages }}"
state: present
- name: Check download progress
async_status:
jid: "{{ item.ansible_job_id }}"
register: download_results
until: download_results.finished
retries: 180
delay: 10
loop: "{{ download_jobs.results }}"
Disaster Recovery and Backup
Production environments require comprehensive disaster recovery and backup strategies implemented through Ansible.
Backup Automation:
# roles/backup/tasks/main.yml
---
- name: Create backup directories
file:
path: "{{ item }}"
state: directory
owner: backup
group: backup
mode: '0750'
loop:
- /backup/database
- /backup/application
- /backup/system
- name: Database backup
mysql_db:
name: "{{ item }}"
state: dump
target: "/backup/database/{{ item }}-{{ ansible_date_time.date }}.sql"
loop: "{{ databases_to_backup }}"
when: "'databases' in group_names"
- name: Application data backup
archive:
path: "{{ app_data_path }}"
dest: "/backup/application/app-data-{{ ansible_date_time.date }}.tar.gz"
remove: false
when: "'web_servers' in group_names"
- name: System configuration backup
archive:
path:
- /etc
- /var/lib
dest: "/backup/system/system-config-{{ ansible_date_time.date }}.tar.gz"
exclude_path:
- /etc/shadow
- /etc/passwd
remove: false
- name: Upload backups to cloud storage
gcp_storage_object:
bucket: "{{ backup_bucket }}"
src: "{{ item }}"
dest: "{{ inventory_hostname }}/{{ item | basename }}"
loop:
- "/backup/database/*.sql"
- "/backup/application/*.tar.gz"
- "/backup/system/*.tar.gz"
Disaster Recovery Playbook:
# playbooks/disaster-recovery.yml
---
- name: Disaster recovery procedures
hosts: localhost
gather_facts: false
vars_prompt:
- name: recovery_type
prompt: "Recovery type (full|partial|data-only)"
private: false
default: "partial"
- name: backup_date
prompt: "Backup date to restore (YYYY-MM-DD)"
private: false
tasks:
- name: Validate recovery parameters
assert:
that:
- recovery_type in ['full', 'partial', 'data-only']
- backup_date | regex_search('^\d{4}-\d{2}-\d{2}$')
fail_msg: "Invalid recovery parameters"
- name: Provision new infrastructure (full recovery)
include: ../terraform/emergency-provision.yml
when: recovery_type == 'full'
- name: Restore from backups
include_tasks: restore-from-backup.yml
vars:
restore_date: "{{ backup_date }}"
- name: Verify system functionality
include_tasks: verify-recovery.yml
- name: Update DNS and load balancer
include_tasks: update-traffic-routing.yml
when: recovery_type == 'full'
Note
The patterns and practices in this chapter represent years of production experience with Infrastructure as Code. While the examples use Google Cloud Platform, the principles apply to any cloud provider or hybrid environment. Focus on understanding the patterns rather than memorizing specific syntax.