Patching Linux Fleets at Scale with Ansible

OS patching is the least glamorous job in operations and one of the most important — most real-world compromises exploit vulnerabilities that had fixes available for months. Done by hand, patching fifty servers is an error-prone weekend. Done with Ansible, it's one command, a controlled rollout, and a report. Here's a battle-tested pattern.

The requirements of safe patching

Batching — never patch everything at once; a bad kernel should hurt one batch, not the fleet.
Pre-checks — confirm disk space, snapshot/backup status, and service health before touching a host.
Controlled reboots — reboot only when required, and wait for the host to come back.
Post-checks — verify services after reboot; stop the rollout if a batch fails.
Evidence — a record of what was patched, when, for the change ticket.

The playbook skeleton

---
- name: Patch Linux fleet
  hosts: linux_patching
  become: true
  serial: "25%"                    # rolling batches: quarter of the group at a time
  max_fail_percentage: 0           # any failure halts the rollout

  pre_tasks:
    - name: Verify at least 2GB free on /
      ansible.builtin.assert:
        that: item.size_available > 2147483648
      loop: "{{ ansible_mounts | selectattr('mount','equalto','/') | list }}"

    - name: Drain node from load balancer   # site-specific: API call, NSG tag, etc.
      ansible.builtin.debug:
        msg: "hook: remove {{ inventory_hostname }} from LB pool"

  tasks:
    - name: Apply all security and bugfix updates (Debian/Ubuntu)
      ansible.builtin.apt:
        upgrade: dist
        update_cache: true
        autoremove: true
      when: ansible_os_family == "Debian"
      register: apt_result

    - name: Apply updates (RHEL/Oracle Linux)
      ansible.builtin.dnf:
        name: "*"
        state: latest
      when: ansible_os_family == "RedHat"

    - name: Check if reboot is required
      ansible.builtin.stat:
        path: /var/run/reboot-required
      register: reboot_flag
      when: ansible_os_family == "Debian"

    - name: Reboot if required
      ansible.builtin.reboot:
        reboot_timeout: 900
        post_reboot_delay: 30
      when: reboot_flag.stat.exists | default(false)

  post_tasks:
    - name: Verify critical services are running
      ansible.builtin.service_facts:

    - name: Assert nginx is active
      ansible.builtin.assert:
        that: ansible_facts.services['nginx.service'].state == 'running'
      when: "'web' in group_names"

    - name: Re-enable node in load balancer
      ansible.builtin.debug:
        msg: "hook: return {{ inventory_hostname }} to LB pool"

The three lines that make it production-grade

serial: "25%" turns a big-bang change into a rolling one. Ansible completes the whole play — patch, reboot, verify — for one batch before starting the next. For clustered systems (database replicas, HA pairs), use explicit ordered groups instead of a percentage so passive nodes always go first.

max_fail_percentage: 0 is your circuit breaker. If any host in a batch fails post-checks, the rollout stops with most of the fleet untouched — exactly the state you want to debug from.

The reboot module quietly does what used to need fragile scripting: issues the reboot, waits for SSH to return, and fails clearly if the host doesn't come back within the timeout.

Rollback thinking

Package managers can downgrade, but the honest rollback strategy for OS patching is restore from snapshot. On cloud platforms, take a boot-volume backup or snapshot as a pre-task (a policy-driven backup shortly before the window also works). Pair that with batching, and your worst case is restoring a handful of machines — not rebuilding a fleet.

Reporting for the change ticket

Register the update results and write a simple summary per host — {{ apt_result.stdout_lines | default([]) }} into a file on the control node, or a templated report. Attach it to the change request and your patching evidence writes itself every cycle.

Cadence beats heroics

The final ingredient is boring consistency: a monthly window per environment (non-production first, production a week later), driven by the same playbook every time. Teams that patch monthly with automation spend less total effort — and carry far less risk — than teams that patch quarterly in heroic manual marathons. Write the playbook once; let the calendar run it.

back to all posts