Automated Incident Response: Building AI Runbooks for Common Server Issues

Automated Incident Response: Building AI Runbooks for Common Server Issues

The 3am Problem Nobody Wants to Solve Manually

Your monitoring system fires an alert. A production server is throwing 502 errors. Response times have spiked to eight seconds. Revenue is bleeding out at roughly $4,000 per minute.

Someone has to wake up, log in, work through a mental checklist, and figure out what's gone wrong - while half-asleep and under pressure. They'll probably fix it eventually. But "eventually" might be 45 minutes from now, and the post-incident review will reveal they missed two diagnostic steps that would have cut that time in half.

This is the operational reality for most Australian businesses running their own infrastructure. Incident response is still largely a human process, dependent on whoever is on call having the right knowledge, the right access, and the right presence of mind at the wrong hour.

Automated incident response AI changes this equation significantly. Not by replacing engineers - but by doing the diagnostic legwork instantly, executing known-good remediation steps without hesitation, and handing humans a situation that's either already resolved or already understood.

This article walks through how to build AI runbooks for the most common server issues, what the architecture actually looks like, and where the boundaries of automation should sit.


What an AI Runbook Actually Is

A traditional runbook is a document. It describes symptoms, asks you to check certain things, and tells you what to do based on what you find. Engineers follow them manually, and the quality of the outcome depends on how carefully the runbook was written and how carefully it's being read at 3am.

An AI runbook is the executable version of that document. It's a structured workflow - typically built around a large language model or a rules-based orchestration layer - that can:

  • Query your monitoring and logging systems directly
  • Interpret metrics and log patterns in context
  • Execute remediation commands against your infrastructure APIs
  • Escalate to humans with a pre-built summary of what it found and what it tried

The key distinction is that an AI runbook doesn't wait for a human to read it. When an alert fires, the runbook runs. By the time an engineer opens their phone, the system has already done the first five minutes of diagnostic work.

Practically, these runbooks are implemented using tools like AWS Systems Manager Automation, Azure Automation, or custom orchestration built on frameworks like LangChain or Temporal. The AI component - typically an LLM with access to your observability data - handles the interpretive layer: deciding which branch of the runbook applies given the current evidence.


Mapping Your Most Common Incidents First

Before you build anything, you need to know what you're actually automating. Pull your incident history from the last 12 months and categorise by type, frequency, and mean time to resolution.

For most hosting environments, the high-frequency incidents cluster around a predictable set of problems:

  • Disk space exhaustion - log directories filling up, database transaction logs growing unchecked
  • Memory pressure and OOM kills - application processes consuming more than their allocation
  • Database connection pool saturation - too many connections queued, application threads blocked
  • Certificate expiry - TLS certificates lapsing because renewal automation failed silently
  • Runaway processes - a single job consuming 100% CPU and starving everything else

These five categories probably account for 60-70% of your P2 and P3 incidents. They're also well-understood problems with known diagnostic steps and known remediation paths. That makes them ideal candidates for your first AI runbooks.

Start with disk space. It's the lowest-risk automation target because the diagnostic is unambiguous and the remediation options are limited and safe: identify what's consuming space, rotate or archive logs if policy allows, alert a human if the cause is unexpected.


Building a Concrete Example: Disk Space Runbook

Here's what a disk space runbook looks like in practice, using AWS Systems Manager Automation as the orchestration layer with an LLM integration for log analysis.

Trigger: CloudWatch alarm fires when any volume on a tagged production instance exceeds 85% utilisation.

Step 1 - Gather context. The runbook queries the instance via SSM Run Command, executing df -h and du -sh /var/log/* /tmp/* /home/* to identify the top consumers. Output is captured and passed to the next step.

Step 2 - Analyse with LLM. The disk usage output is sent to an LLM (GPT-4o or Claude 3.5 Sonnet work well here) with a structured prompt that asks it to: identify the primary space consumer, determine whether it matches known patterns (log rotation failure, specific application log growth, database growth), and recommend a remediation action from an approved list.

The approved list is important. You're not asking the LLM to invent solutions - you're asking it to classify the problem and select from pre-approved remediation scripts. This keeps the blast radius of any LLM error small.

Step 3 - Execute approved remediation. If the LLM classifies the issue as "log rotation failure" with confidence above a defined threshold, the runbook executes your standard log rotation script and rechecks disk usage. If classification confidence is low, or if the primary consumer is something unexpected (like a database data directory), it skips remediation and goes straight to escalation.

Step 4 - Escalate with context. A PagerDuty or Opsgenie alert fires with a structured summary: current disk usage by directory, LLM classification, whether remediation was attempted, and the current state after any remediation. The on-call engineer gets a situation report, not a raw alarm.

This runbook resolves log rotation failures - a common cause of disk exhaustion - without human involvement. For everything else, it hands over a much clearer picture than a raw CloudWatch alarm would provide.


Handling the Trickier Cases: Memory and Database Issues

Memory pressure incidents are more complex because the causes vary more. A memory leak in application code behaves differently from a cache that's grown beyond its configured limit, which behaves differently from a legitimate traffic spike that's simply outgrown current capacity.

Your automated incident response AI runbook for memory pressure should follow a similar pattern but with more conservative remediation boundaries.

Diagnostic steps: Capture free -m, ps aux --sort=-%mem | head -20, and application-specific memory metrics (JVM heap if it's a Java application, Node.js process memory if it's Node). Check whether memory growth is recent and sudden or gradual over hours.

Remediation boundaries: Restarting an application process is a valid automated remediation for a process showing signs of a memory leak (gradual growth over hours, no corresponding traffic increase). Restarting a database process is not - that's always a human decision. Increasing swap is a temporary measure that might buy time but shouldn't be applied automatically without understanding the cause.

For database connection pool saturation, the diagnostic is usually clearer: query your database for active connections, identify whether they're blocked on locks or simply accumulated beyond the pool limit, and check application logs for connection timeout errors. Remediation options include killing idle connections that have been open beyond a threshold, which is generally safe to automate, and adjusting pool configuration, which usually requires a human.

The principle across all of these: automate the diagnosis completely, automate remediation conservatively, and always escalate with context.


Integrating AI Runbooks With Your Existing Observability Stack

AI runbooks don't replace your monitoring stack - they sit on top of it. The integration points matter.

Most Australian organisations running production infrastructure are working with some combination of Datadog, New Relic, Grafana, CloudWatch, or Prometheus. All of these expose APIs that your runbook orchestration layer can query. The practical architecture looks like this:

  • Alert routing - your monitoring system fires alerts to a webhook endpoint or directly to your runbook orchestration layer (Systems Manager, a Lambda function, a custom service)
  • Data collection - the runbook queries your observability APIs and your infrastructure APIs to gather current state
  • LLM analysis - structured data is passed to an LLM with a specific prompt template designed for that incident type
  • Action execution - approved remediation actions are executed via infrastructure APIs (AWS SSM, Azure Run Command, Kubernetes API, your hosting provider's API)
  • Escalation - results are written back to your alerting system with structured context

One practical consideration: LLM API latency adds time to your runbook execution. For most incident types this is acceptable - 2-3 seconds for an LLM call is negligible when the alternative is waking someone up. But for runbooks where speed is critical, consider whether the LLM analysis step can run in parallel with early remediation steps rather than sequentially.


Where Automation Ends and Humans Begin

The most important design decision in any automated incident response AI system is defining the escalation boundaries clearly. Automation should never be a reason for an engineer to feel like they've lost control of their infrastructure.

Hard rules worth establishing before you build:

  • Never automate remediation for data-layer issues without explicit human approval. Database restarts, schema changes, backup restoration - these always require a human in the loop.
  • Set a maximum automated action count per incident. If a runbook has tried three remediation steps and the problem persists, it should stop and escalate. Automated systems can make things worse if they keep trying.
  • Require human confirmation for any action that affects more than one system. A runbook can restart a single application process. It shouldn't be able to trigger a rolling restart across your entire fleet without a human approving that scope.
  • Log everything the runbook does. Every action taken by an automated system should be auditable. Your post-incident review should be able to reconstruct exactly what the runbook tried, in what order, and what the outcome was.

The goal is an on-call experience where engineers are woken up for genuinely complex problems - not for the routine issues that a well-built runbook can handle in under two minutes.


What to Do Next

If you're starting from scratch, the practical path forward is straightforward:

  1. Export your incident history from the last 12 months and categorise by type and frequency. Find your top five incident categories.
  2. Write the manual runbook first. If you can't describe the diagnostic and remediation steps clearly in a document, you can't automate them reliably.
  3. Build the disk space runbook as your first automation. It's low-risk, high-frequency, and gives you a working template for everything that follows.
  4. Run the automation in observation-only mode initially. Let it execute the diagnostic steps and generate recommendations, but require human approval for any remediation. Review the recommendations for two weeks before enabling automated execution.
  5. Measure the impact. Track mean time to resolution before and after. Track the percentage of incidents that are resolved without human intervention. Use those numbers to justify expanding the scope of automation.

If your team is spending significant engineering time on repetitive incidents that follow predictable patterns, automated incident response AI is worth serious investment. The technology is mature enough to be reliable, and the operational gains are measurable.

Exponential Tech works with Australian organisations to design and implement AI automation for infrastructure operations. If you'd like to talk through what this looks like for your specific environment, get in touch with our team.

Related Service

AI Strategy & Governance

A clear roadmap from assessment to AI-native operations.

Learn More
Stay informed

Get AI insights delivered

Practical AI implementation tips for IT leaders — no hype, just what works.

Keep reading

Related articles

Ask about our services
Hi! I'm the Exponential Tech assistant. Ask me anything about our AI services — I'm here to help.