AI Failure Analysis

When a deployment fails, Galleon reads the logs, identifies the root cause, and — when the fix is in your infrastructure or configuration — opens a pull request with the resolution. Analysis runs automatically within seconds of a failure.

What Galleon fixes vs. what Galleon diagnoses

There's an important boundary worth stating upfront:

Galleon writes fix PRs for infrastructure and configuration issues. Wrong container ports, missing IAM permissions, misconfigured environment variables, undersized memory limits, broken health check paths — anything that lives in your Terraform, your Dockerfile, your workflow, or your AWS resource configuration.

Galleon diagnoses but does not modify your application code. If your container crashes because of an unhandled exception, a missing dependency in your requirements.txt, or a syntax error in your handler, Galleon identifies the issue and surfaces the specific log lines and suggested fix in the deployment UI — but the code change is yours to make. We don't write your business logic.

This boundary exists deliberately. Auto-generated infrastructure fixes are reviewable, scoped, and safe to merge. Auto-generated application code is none of those things.

How analysis works

Galleon uses a two-stage approach: fast pattern matching first, AI analysis only when needed.

Stage 1: Pattern matching (zero-cost, sub-second). Most deployment failures fall into a small set of recurring patterns. Galleon checks the logs against a library of known failure signatures and returns a diagnosis immediately when there's a match. No LLM call, no cost, no waiting.

Stage 2: AI analysis (Claude, when patterns don't match). For ambiguous or novel failures, Galleon sends the logs and relevant repository context to Claude for deeper analysis. The model identifies the root cause, explains it in plain language, and proposes a specific fix.

The pattern library handles the bulk of common failures — port mismatches, IAM permission errors, OOM kills, missing env vars — because those failures are repetitive across customers. AI analysis handles the long tail, where the failure is specific to your application's setup.

Container failures (ECS Fargate)

When a container deployment fails, Galleon fetches CloudWatch logs from the failed task and matches against known failure patterns.

Patterns that produce fix PRs:

  • Container port mismatch — Port exposed in the Dockerfile or task definition doesn't match the load balancer target group. Fix PR adjusts the Terraform configuration.
  • Health check path returning non-200 — The container starts but the configured health check endpoint isn't responding correctly. Fix PR updates the health check path or grace period.
  • IAM permission errors — The task role can't pull from ECR, read from Secrets Manager, or perform another required action. Fix PR adds the missing permission to the role policy.
  • Memory or CPU exhaustion — The container is being killed by ECS due to resource limits. Fix PR steps up the task definition's memory or CPU allocation.
  • Missing environment variables — A variable Galleon expected isn't set. Fix PR adds the variable with a placeholder for you to fill in.

Patterns that produce diagnosis only:

  • Missing or invalid Dockerfile — Galleon flags the issue and suggests how to structure a Dockerfile for your framework, but doesn't write one for you.
  • Application crash on startup — Syntax errors, missing imports, broken entrypoints. Galleon identifies the offending log lines and the likely cause; you make the fix.
  • Dependency errors — Missing package, version conflict, or import failure. Galleon identifies the missing or conflicting dependency; you update your requirements file.

Serverless failures (Lambda)

Lambda failure analysis launched in v1.7 (March 2026) and currently has fewer pattern-matched cases than container analysis. Coverage is expanding.

For serverless Next.js deployments, Galleon checks the GitHub Actions build logs first, then resolves CloudWatch log groups for the four Lambda functions (server, image optimizer, revalidation, warmer).

Patterns that produce fix PRs:

  • Out of memory — Lambda hit its memory limit. Fix PR steps up the memory allocation.
  • Timeout — Lambda exceeded its execution time limit. Fix PR increases the timeout.
  • IAM permission errors — The Lambda execution role is missing a required permission. Fix PR adds it.

Patterns that produce diagnosis only:

  • Bundle size warnings — Galleon flags oversized bundles that affect cold-start latency, but bundle reduction requires application-level changes.
  • Build failures — Missing dependencies, broken imports, OpenNext compatibility issues. Galleon surfaces the diagnosis; you fix the code.

Fix PRs

When pattern matching or AI analysis identifies a fixable issue, Galleon opens a pull request on your repository containing:

  • The proposed change — Terraform diff, Dockerfile edit, or workflow update that addresses the root cause.
  • The diagnosis — A summary of what failed, with the relevant log lines.
  • The reasoning — Why the fix resolves the issue, and what the change does.

You review and merge the PR like any other code change. When you merge, your existing GitHub Actions workflow redeploys with the fix applied — no separate trigger required.

Deduplication

Galleon avoids opening duplicate fix PRs for the same recurring issue. If a deployment fails with the same error pattern that Galleon has already analyzed and PR'd within the last 60 minutes, Galleon skips re-analysis and points you to the existing PR.

This is a deduplication safeguard, not a rate limit on overall analysis. New failure patterns are analyzed immediately even if a previous failure is still in cooldown.

Viewing analysis results

Analysis appears in the deployment detail view on the Runs page. Each analysis shows:

  • Root cause — A short summary of what failed.
  • Suggested fix — The recommended change, with reasoning.
  • Confidence — A percentage indicating how certain the diagnosis is. Above 80% means the pattern was an exact match or the AI analysis had high-quality signal. Below 60% means the diagnosis is plausible but worth treating as a starting point rather than a definitive answer.
  • Fix PR — A direct link to the auto-created pull request, when one was generated.

Next steps