OS patching is the least glamorous job in operations and one of the most important — most real-world compromises exploit vulnerabilities that had fixes available for months. Done by hand, patching fifty servers is an error-prone weekend. Done with Ansible, it's one command, a controlled rollout, and a report. Here's a battle-tested pattern.
The requirements of safe patching
- Batching — never patch everything at once; a bad kernel should hurt one batch, not the fleet.
- Pre-checks — confirm disk space, snapshot/backup status, and service health before touching a host.
- Controlled reboots — reboot only when required, and wait for the host to come back.
- Post-checks — verify services after reboot; stop the rollout if a batch fails.
- Evidence — a record of what was patched, when, for the change ticket.
The playbook skeleton
---
- name: Patch Linux fleet
hosts: linux_patching
become: true
serial: "25%" # rolling batches: quarter of the group at a time
max_fail_percentage: 0 # any failure halts the rollout
pre_tasks:
- name: Verify at least 2GB free on /
ansible.builtin.assert:
that: item.size_available > 2147483648
loop: "{{ ansible_mounts | selectattr('mount','equalto','/') | list }}"
- name: Drain node from load balancer # site-specific: API call, NSG tag, etc.
ansible.builtin.debug:
msg: "hook: remove {{ inventory_hostname }} from LB pool"
tasks:
- name: Apply all security and bugfix updates (Debian/Ubuntu)
ansible.builtin.apt:
upgrade: dist
update_cache: true
autoremove: true
when: ansible_os_family == "Debian"
register: apt_result
- name: Apply updates (RHEL/Oracle Linux)
ansible.builtin.dnf:
name: "*"
state: latest
when: ansible_os_family == "RedHat"
- name: Check if reboot is required
ansible.builtin.stat:
path: /var/run/reboot-required
register: reboot_flag
when: ansible_os_family == "Debian"
- name: Reboot if required
ansible.builtin.reboot:
reboot_timeout: 900
post_reboot_delay: 30
when: reboot_flag.stat.exists | default(false)
post_tasks:
- name: Verify critical services are running
ansible.builtin.service_facts:
- name: Assert nginx is active
ansible.builtin.assert:
that: ansible_facts.services['nginx.service'].state == 'running'
when: "'web' in group_names"
- name: Re-enable node in load balancer
ansible.builtin.debug:
msg: "hook: return {{ inventory_hostname }} to LB pool"
The three lines that make it production-grade
serial: "25%" turns a big-bang change into a rolling one.
Ansible completes the whole play — patch, reboot, verify — for one batch before starting the
next. For clustered systems (database replicas, HA pairs), use explicit ordered groups
instead of a percentage so passive nodes always go first.
max_fail_percentage: 0 is your circuit breaker. If any host in
a batch fails post-checks, the rollout stops with most of the fleet untouched — exactly the
state you want to debug from.
The reboot module quietly does what used to need fragile
scripting: issues the reboot, waits for SSH to return, and fails clearly if the host doesn't
come back within the timeout.
Rollback thinking
Package managers can downgrade, but the honest rollback strategy for OS patching is restore from snapshot. On cloud platforms, take a boot-volume backup or snapshot as a pre-task (a policy-driven backup shortly before the window also works). Pair that with batching, and your worst case is restoring a handful of machines — not rebuilding a fleet.
Reporting for the change ticket
Register the update results and write a simple summary per host —
{{ apt_result.stdout_lines | default([]) }} into a file on the control node,
or a templated report. Attach it to the change request and your patching evidence
writes itself every cycle.
Cadence beats heroics
The final ingredient is boring consistency: a monthly window per environment (non-production first, production a week later), driven by the same playbook every time. Teams that patch monthly with automation spend less total effort — and carry far less risk — than teams that patch quarterly in heroic manual marathons. Write the playbook once; let the calendar run it.
back to all posts