Running n8n in Production: Architecture, Error Handling, Monitoring

Circuit board automation — Workflow orchestration tier — the system serving a B2B operations team's 800 active n8n flows.

n8n is one of the best things to happen to engineering teams in years. It collapses a category of integration work that used to require a backend service, a queue, a scheduler, and a maintenance budget into a workflow you can draft over lunch. It is genuinely transformative — until you cross the threshold from "n8n as a tool" to "n8n as critical infrastructure."

The transition is not gradual. One morning you wake up, look at the dashboard, and realise the business now runs on 400 workflows that talk to Stripe, HubSpot, Postgres, and three internal services. A failing workflow is a customer-facing outage. A leaked credential is a compliance event. And you're still on a single Docker container with a sticky note in the README that says "do not restart between 9 AM and 6 PM."

This is the playbook we use to take that environment from fragile to genuinely production-grade.

Queue Mode Is Not Optional

The default n8n deployment runs every workflow execution inside the same process that serves the editor UI. For development and the first few dozen workflows this is fine. For real production traffic it's a bottleneck and a single point of failure: long-running workflows starve the UI, a single bad node crashes the entire instance, and you can't scale horizontally.

The fix is queue mode, where n8n splits into three concerns:

Main process — serves the editor UI and the webhook receivers, pushes jobs onto Redis.
Worker processes — pull jobs from Redis and execute workflows. Horizontally scalable.
Redis — the message bus and execution state store.

A baseline docker-compose for this looks like the following. Note the explicit EXECUTIONS_MODE, the shared database, and the worker concurrency settings.

version: "3.8"

x-n8n-env: &n8n-env
  DB_TYPE: postgresdb
  DB_POSTGRESDB_HOST: postgres
  DB_POSTGRESDB_DATABASE: n8n
  DB_POSTGRESDB_USER: n8n
  DB_POSTGRESDB_PASSWORD: ${POSTGRES_PASSWORD}
  EXECUTIONS_MODE: queue
  QUEUE_BULL_REDIS_HOST: redis
  QUEUE_BULL_REDIS_PORT: 6379
  N8N_ENCRYPTION_KEY: ${N8N_ENCRYPTION_KEY}
  N8N_HOST: n8n.internal.example.com
  N8N_PROTOCOL: https
  WEBHOOK_URL: https://n8n.internal.example.com/

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: n8n
      POSTGRES_USER: n8n
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    command: ["redis-server", "--appendonly", "yes"]
    volumes:
      - redis_data:/data

  n8n-main:
    image: n8nio/n8n:latest
    environment:
      <<: *n8n-env
    ports:
      - "5678:5678"
    depends_on: [postgres, redis]

  n8n-worker:
    image: n8nio/n8n:latest
    command: ["worker", "--concurrency=10"]
    environment:
      <<: *n8n-env
    deploy:
      replicas: 4
    depends_on: [postgres, redis]

volumes:
  postgres_data:
  redis_data:

On Kubernetes the model is identical: one Deployment for the main pod, one Deployment for workers with an HPA scaling on Redis queue depth, and a managed Redis (ElastiCache, Memorystore, Upstash) for state. A reasonable starting point: 1 main pod, 4 worker pods at 10 concurrency each, scaling to 12 workers under load.

Error Workflows: The Pattern That Saves Your Nights

Out of the box, a failed n8n workflow logs the error and goes silent. The user who triggered it gets nothing. The on-call engineer finds out from the customer who didn't get their invoice.

The pattern we use everywhere: a single shared error workflow that every other workflow points at via the Settings → Error Workflow dropdown. When any workflow fails, n8n triggers the error workflow with the full error payload — workflow name, node that failed, error message, input data, execution ID.

The error workflow then:

Posts a structured alert to a dedicated Slack channel with a deep link to the failed execution.
For high-severity workflows (tagged via metadata), pages on-call via PagerDuty.
Writes the error to a Postgres workflow_errors table so we can trend on failure rate per workflow.
For idempotent workflows, schedules a single retry 5 minutes later.

The retry logic inside a Function node looks like this:

// In the shared error workflow's Function node
const { workflow, execution } = $input.first().json;

// Only auto-retry workflows explicitly tagged as idempotent
const tags = workflow.tags || [];
const isIdempotent = tags.some(t => t.name === 'idempotent');
const previousRetries = execution.retryOf ? 1 : 0;

if (!isIdempotent || previousRetries >= 1) {
  return [{ json: { action: 'alert_only', reason: 'not_retryable' } }];
}

return [{
  json: {
    action: 'retry',
    workflowId: workflow.id,
    executionId: execution.id,
    delay_ms: 5 * 60 * 1000,
  }
}];

This one pattern eliminates roughly 70% of the "did the workflow run?" tickets that flood into ops channels in untreated n8n environments.

Credentials: Out of the Database, Into a Secret Store

By default, n8n encrypts credentials in its own database with a single N8N_ENCRYPTION_KEY. That works, but it puts your production secrets one database leak away from being usable. For anything beyond an internal sandbox we externalise secret storage entirely.

Our pattern: credentials live in HashiCorp Vault (or AWS Secrets Manager, GCP Secret Manager — pick your cloud's flavour). n8n's External Secrets feature pulls them at workflow execution time using a short-lived token. No production secret is ever stored at rest inside n8n's database, and rotation happens at the Vault layer without redeploying workflows.

The encryption key itself is mounted from the cloud's KMS, not a static environment variable in a compose file. N8N_ENCRYPTION_KEY rotation is treated like any other secret rotation — quarterly, with a documented playbook.

Version Control: Workflows as Code

The single most common production failure mode in n8n is a workflow that worked yesterday and doesn't today, because someone clicked something in the editor and saved it. Without version control you have no way to diff, review, or roll back.

n8n's Source Control feature (Enterprise tier) and the open-source n8n-cli both let you sync workflows to a Git repo as JSON files. The pattern we enforce:

Production n8n instance has a one-way sync from main branch — workflows in Git are the source of truth.
Development happens in a separate n8n instance pointed at a feature branch.
Changes go through PR review with the workflow JSON diff visible to reviewers.
Merging to main triggers a CI job that imports the workflows into production.

This sounds heavy and is wildly worth it. The first time you can git revert a workflow change at 2 AM instead of trying to remember what node configuration you had three days ago, you'll never go back.

Monitoring: Prometheus, Grafana, and Alerts That Mean Something

n8n exposes a /metrics endpoint when N8N_METRICS=true is set. The metrics that actually matter for production:

n8n_workflow_executions_total — labelled by workflow and status (success/failure/error).
n8n_workflow_execution_duration_seconds — histogram, labelled by workflow.
n8n_queue_length — depth of the Redis job queue. The leading indicator of "we need more workers."
n8n_active_workflow_count — total workflows currently executing.

From these you build four alerts that catch nearly everything that goes wrong in practice:

Per-workflow failure rate — alert if any workflow's 1-hour failure rate exceeds 5% (configurable per workflow's criticality).
Queue depth — warn at 100 pending jobs, page at 500. This catches both worker outages and runaway producer workflows.
Execution duration p95 — alert on any workflow whose p95 duration doubles over a 24-hour window. Usually a sign that an upstream API has degraded.
Worker availability — alert if fewer than N worker pods are Ready for more than 2 minutes.

Grafana dashboards we keep on the ops wall: per-workflow execution heatmap, queue depth time series, failure rate top-10 leaderboard, and a "longest currently-running execution" gauge for catching infinite loops.

Scaling Workers Based on Real Load

Static worker counts waste money during quiet periods and queue up during bursts. On Kubernetes we autoscale workers using KEDA with a Redis trigger, which lets the queue depth itself drive scaling:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: n8n-worker-scaler
spec:
  scaleTargetRef:
    name: n8n-worker
  minReplicaCount: 2
  maxReplicaCount: 20
  pollingInterval: 15
  cooldownPeriod: 120
  triggers:
    - type: redis
      metadata:
        address: redis.n8n.svc.cluster.local:6379
        listName: bull:jobs:wait
        listLength: "25"

This keeps a 2-pod floor (so we can absorb sudden bursts without cold-start), scales up aggressively when queue depth exceeds 25 jobs per replica, and scales back down after 2 minutes of quiet. Combined with spot instances for the worker node pool, our largest client runs roughly 12,000 daily workflow executions on a modest two-figure monthly compute footprint.

Key Takeaways

Queue mode is the line between "tool" and "infrastructure." Cross it before you cross 100 workflows.
A single shared error workflow eliminates the silent-failure category entirely.
Credentials belong in Vault or your cloud's secret manager — not in n8n's database.
Version control workflows as JSON in Git. Production syncs one-way from main.
The four alerts that matter: per-workflow failure rate, queue depth, execution duration, worker availability.
Autoscale workers on Redis queue depth with KEDA. Static replica counts waste money or cause outages.

n8n at Production Scale: Workflows That Don't Break at 3 AM

Queue Mode Is Not Optional

Error Workflows: The Pattern That Saves Your Nights

Credentials: Out of the Database, Into a Secret Store

Version Control: Workflows as Code

Monitoring: Prometheus, Grafana, and Alerts That Mean Something

Scaling Workers Based on Real Load

Key Takeaways

Refactoring Without Rewriting: The Strangler Fig Pattern in Practice

Connecting AI Agents to Real Business Systems Without Wrecking Them

The Notion API in Anger: Building Real Integrations That Last

Want to work together?

n8n at Production Scale: Workflows That Don't Break at 3 AM

Queue Mode Is Not Optional

Error Workflows: The Pattern That Saves Your Nights

Credentials: Out of the Database, Into a Secret Store

Version Control: Workflows as Code

Monitoring: Prometheus, Grafana, and Alerts That Mean Something

Scaling Workers Based on Real Load

Key Takeaways

Related articles

Refactoring Without Rewriting: The Strangler Fig Pattern in Practice

Connecting AI Agents to Real Business Systems Without Wrecking Them

The Notion API in Anger: Building Real Integrations That Last

Want to work together?