NodeBridge #4: The Prod Checklist, Part 1 (16 Items for Error-Free Workflows)

Issue 4 • January 6, 2026

Runbooks and checklists are your shield against those 3am pings, Slacks, and texts

⏱️ 12 min read

In This Issue

The Checklist (Part 1 of 2)
1.1 Error Notifications
1.2 Try/Catch Blocks on Critical Operations
1.3 Graceful Degradation
1.4 Retry Logic with Exponential Backoff
1.5 Dead Letter Queue (Failed Items Storage)
1.6 Circuit Breaker Pattern
2.1 Input Validation on Webhooks
2.2 Sanitize User Input
2.3 Null/Undefined Checks
2.4 Data Type Validation
2.5 Length and Size Limits
3.1 Execution Logging
3.2 Performance Metrics
3.3 Health Check Endpoint
3.4 Alert on Unusual Patterns
3.5 Weekly Summary Reports
Part 1 Checklist Summary

You built a workflow. It's solid green. You deploy it. Stakeholders begin depending on it.

Two weeks later, it breaks at 3am. No error notification. You find out when the marketing manager messages you at 7am asking why the sales report didn't arrive. You scramble to fix it, but you don't know what broke because there are no logs.

You fix the immediate issue. Three days later, the workflow breaks again. Different error. This time it's a rate limit you didn't know existed. Your workflow made 1,000 API calls in 10 minutes. Your account is temporarily banned.

You fix that. A week later, a user sends malformed data through a webhook. Your workflow crashes. It tries to process the same bad data 50 times before you even notice. Your database is now full of duplicate garbage records.

This is what happens when you skip the production checklist. Your workflow works in testing, but it's not ready for production. Production means edge cases, malicious input, API failures, rate limits, concurrent executions, and a hundred other things you didn't test for.

Today I'm starting a two-part production deployment checklist: 26 items that separate hobby workflows from production-grade automation. This week covers the first 16 items across Error Handling, Data Validation, and Monitoring. Next week we'll cover Performance, Security, and Deployment.

If you're deploying workflows that other people depend on, this checklist is not optional.

The Checklist (Part 1 of 2)

This week covers 3 categories: Error Handling, Data Validation, and Monitoring & Logging.

Each item includes what it is, why it matters, and how to implement it.

1. Error Handling (6 Items)

1.1 Error Notifications

What it is: Send yourself an alert when the workflow fails.

Why it matters: If you don't know it failed, you can't fix it. Users will report failures before you discover them.

How to implement: Add an error trigger workflow that sends Slack/email alerts on any workflow failure. We covered this in Issue #1.

Test: Intentionally break a node. Verify you receive an alert within 1 minute.

1.2 Try/Catch Blocks on Critical Operations

What it is: Wrap risky operations (API calls, database writes) in error handlers so failures don't crash the workflow.

Why it matters: One failed API call shouldn't stop the entire workflow. Isolate failures.

How to implement: Use "Continue on Fail" on nodes that might fail, then check for errors in the next node with an IF condition.

Example:

API Call node (Continue on Fail: ON)
  |
IF node: Check if $json.error exists
  | (error exists)
  Log error, send alert, skip this record
  | (no error)
  Continue processing

Test: Send invalid data to the API. Verify workflow continues and logs the error.

1.3 Graceful Degradation

What it is: When a non-critical service fails, the workflow continues with reduced functionality instead of stopping.

Why it matters: If your analytics logging fails, the customer order should still process. Prioritize critical operations.

How to implement: Separate critical path (order processing) from nice-to-have (analytics tracking). Use error handlers to skip non-critical failures.

Example:

Process customer order (critical)
  |
Write to database (critical)
  |
Send confirmation email (critical)
  |
Log to analytics (non-critical, Continue on Fail)
  |
Update dashboard (non-critical, Continue on Fail)

Test: Disable analytics API. Verify order still processes successfully.

1.4 Retry Logic with Exponential Backoff

What it is: If an API call fails, retry it 3-5 times with increasing delays between attempts.

Why it matters: Network blips and temporary API outages are common. Retrying often succeeds.

How to implement: Node settings -> Retry on Fail -> Max Tries: 3-5, Wait Between Tries: 2000ms (increase for each retry).

Advanced: Implement exponential backoff (wait 2s, then 4s, then 8s) for persistent failures.

Test: Temporarily disable an API. Verify node retries before failing.

1.5 Dead Letter Queue (Failed Items Storage)

What it is: When an item fails processing, store it in a separate location (database, Google Sheet, queue) for manual review instead of losing it.

Why it matters: Failed items contain valuable data. You need to review and reprocess them, not discard them.

How to implement: On error, write the failed item to a "failed_items" table or sheet with timestamp and error message.

Example:

Process item
  | (on error)
  Write to failed_items sheet: { item_id, data, error, timestamp }
  |
Send alert: "3 items failed processing, review failed_items sheet"

Test: Send data that triggers a failure. Verify it's logged in the dead letter queue.

1.6 Circuit Breaker Pattern

What it is: If an API fails 5+ times in a row, stop calling it for 5-10 minutes instead of hammering it with requests.

Why it matters: Repeated failed calls waste resources, trigger rate limits, and delay recovery. Give the failing service time to recover.

How to implement: Track failure count in a database or variable. If failures exceed threshold, skip API calls for a cooldown period.

Advanced: Use n8n's built-in rate limiting or implement custom logic with timestamp checks.

Test: Force 5 consecutive failures. Verify workflow stops calling the API and resumes after cooldown.

2. Data Validation (5 Items)

2.1 Input Validation on Webhooks

What it is: Validate all incoming webhook data before processing. Check data types, required fields, and format.

Why it matters: Malicious or malformed data will crash your workflow. Webhooks are public endpoints; anyone can send anything.

How to implement: Add a validation node immediately after webhook trigger. Check for required fields, validate email format, sanitize strings.

Example:

// Validation node
const required = ['email', 'name', 'order_id'];
for (const field of required) {
  if (!$json[field]) {
    throw new Error(`Missing required field: ${field}`);
  }
}
if (!/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test($json.email)) {
  throw new Error('Invalid email format');
}
return $input.all();

Test: Send webhook with missing fields, invalid email, or malicious data. Verify workflow rejects it.

2.2 Sanitize User Input

What it is: Remove or escape special characters from user input to prevent injection attacks (SQL injection, script injection).

Why it matters: A user sending '; DROP TABLE users; -- in a name field could delete your database if not sanitized.

How to implement: Use code nodes to sanitize strings before using them in queries or APIs.

Example:

function sanitize(str) {
  return str.replace(/[^\w\[email protected]]/g, ''); // Allow only alphanumeric, spaces, @, ., -
}
$json.name = sanitize($json.name);
$json.email = sanitize($json.email);
return $input.all();

Test: Send input with special characters (script tags, SQL commands). Verify they're sanitized.

2.3 Null/Undefined Checks

What it is: Check if values exist before using them in operations.

Why it matters: "Cannot read property 'name' of undefined" is the #1 production error. Check before accessing nested properties.

How to implement: Use optional chaining or explicit null checks.

Example:

// Bad
const name = $json.customer.name; // Crashes if customer is undefined

// Good
const name = $json.customer?.name || 'Unknown';
// or
if ($json.customer && $json.customer.name) {
  const name = $json.customer.name;
}

Test: Send data with missing nested fields. Verify workflow handles it gracefully.

2.4 Data Type Validation

What it is: Verify numbers are numbers, dates are dates, arrays are arrays.

Why it matters: APIs often return unexpected types. A string "123" is not a number; math operations will fail.

How to implement: Check types before operations.

Example:

if (typeof $json.quantity !== 'number') {
  $json.quantity = parseInt($json.quantity, 10);
}
if (isNaN($json.quantity)) {
  throw new Error('Invalid quantity');
}

Test: Send string values for numeric fields. Verify workflow converts or rejects them.

2.5 Length and Size Limits

What it is: Enforce maximum lengths on strings, arrays, and file uploads.

Why it matters: A user uploading a 500MB file will crash your workflow. A description field with 100,000 characters will break your database.

How to implement: Check lengths before processing.

Example:

if ($json.description.length > 1000) {
  throw new Error('Description too long (max 1000 characters)');
}
if ($json.items.length > 100) {
  throw new Error('Too many items (max 100)');
}

Test: Send oversized data. Verify workflow rejects it with clear error message.

3. Monitoring & Logging (5 Items)

3.1 Execution Logging

What it is: Log every workflow execution with timestamp, input data, and result (success/failure).

Why it matters: When something breaks, logs are how you debug. Without logs, you're guessing.

How to implement: Add a "Log Execution" node at the start and end of the workflow. Write to Google Sheet, database, or file.

Example:

// Log node
return [{
  workflow: $workflow.name,
  execution_id: $execution.id,
  timestamp: new Date().toISOString(),
  input: JSON.stringify($input.all()),
  status: 'started'
}];

Test: Run workflow 5 times. Verify all executions are logged.

3.2 Performance Metrics

What it is: Track how long each workflow takes to execute and how many items it processes.

Why it matters: Performance degradation is a warning sign. If a workflow that used to take 2 seconds now takes 30 seconds, something is wrong.

How to implement: Log execution start time and end time. Calculate duration.

Example:

// At end of workflow
const start = $node["Start"].json.timestamp;
const end = new Date().toISOString();
const duration = new Date(end) - new Date(start);
return [{
  workflow: $workflow.name,
  duration_ms: duration,
  items_processed: $input.all().length
}];

Test: Run workflow multiple times. Verify duration is logged and consistent.

3.3 Health Check Endpoint

What it is: A simple workflow that returns "OK" when triggered, proving n8n is running.

Why it matters: External monitoring tools (UptimeRobot, Pingdom) can check if your n8n instance is alive.

How to implement: Create a workflow with webhook trigger that returns { status: "OK", timestamp: now }.

Example:

Webhook trigger (GET /health)
  |
Respond to Webhook: { "status": "OK", "timestamp": "2026-01-07T10:00:00Z" }

Test: Call the endpoint. Verify it returns 200 OK.

3.4 Alert on Unusual Patterns

What it is: Send alerts when metrics deviate from normal (execution time 10x longer, failure rate above 5%, etc.).

Why it matters: Catches problems before they become disasters. A gradual increase in execution time indicates a growing dataset or performance issue.

How to implement: Track baseline metrics. Alert when current execution exceeds baseline by X%.

Example:

const baseline_duration = 2000; // 2 seconds
const current_duration = $json.duration_ms;
if (current_duration > baseline_duration * 3) {
  // Send alert: execution took 3x longer than normal
}

Test: Artificially slow down workflow (add long wait). Verify alert fires.

3.5 Weekly Summary Reports

What it is: Send yourself a weekly email with workflow statistics: executions, failures, avg duration, most common errors.

Why it matters: Proactive monitoring. You spot trends before they become critical failures.

How to implement: Create a scheduled workflow (runs every Monday 8am) that queries execution logs and sends summary email.

Example:

Schedule Trigger (Monday 8am)
  |
Query execution logs (last 7 days)
  |
Aggregate: total runs, failures, avg duration, error breakdown
  |
Send email with summary

Test: Run manually. Verify email contains accurate statistics.

Part 1 Checklist Summary

Error Handling:

[ ] Error notifications configured
[ ] Try/catch blocks on critical operations
[ ] Graceful degradation for non-critical failures
[ ] Retry logic with exponential backoff
[ ] Dead letter queue for failed items
[ ] Circuit breaker pattern for repeated failures

Data Validation:

[ ] Input validation on webhooks
[ ] Sanitize user input
[ ] Null/undefined checks
[ ] Data type validation
[ ] Length and size limits

Monitoring & Logging:

[ ] Execution logging
[ ] Performance metrics tracking
[ ] Health check endpoint
[ ] Alerts on unusual patterns
[ ] Weekly summary reports

Quick Wins

Actions You Can Take This Week

🟢 Beginner • 20 min

Print This Checklist: Copy the checklist above. Print it. For your most critical workflow, go through each item and mark what you have vs. what's missing. You don't need to fix everything today; just know your gaps. Most people discover they're missing 10+ items from just these first 16.

🟡 Intermediate • 30 min

Add Error Notifications to All Production Workflows: If you haven't done this yet (Issue #1), do it now. This is the single most valuable item on the checklist. Every production workflow should send an alert when it fails. No exceptions.

🟡 Intermediate • 45 min

Implement Input Validation on Your Riskiest Webhook: Pick the webhook that processes the most critical data (orders, payments, user signups). Add a validation node that checks for required fields, validates data types, and sanitizes input. Test by sending malformed data and verifying it's rejected.

🔴 Advanced • 90 min

Build a Weekly Summary Report: Create a scheduled workflow that runs every Monday morning and emails you a summary of the past week: total executions per workflow, failure rates, average execution time, and top 5 errors. This gives you visibility into patterns and trends you'd otherwise miss.

Next Week

NodeBridge #5: The Production Checklist, Part 2 (10 More Items)

We covered Error Handling, Data Validation, and Monitoring. Next week we finish the checklist with Performance & Limits (rate limiting, batch processing, timeouts, idempotency), Security (webhook auth, environment variables, least privilege), and Deployment (version control, staging, documentation). Plus the complete 26-item printable checklist.

SOON: Watch the companion tutorials on YouTube (subscribe at youtube.com/@nodebridge_dev) where I'll walk through setting up execution logging, building a health check endpoint, and implementing retry logic with exponential backoff.

Got a production workflow that's missing items from this checklist?

Reply and tell me which items you're struggling to implement. I read every response and often create deep-dives or tutorials based on what readers need most.

Got a broken workflow that's driving you crazy?

Reply to this email and tell me about it. I read every response and plan to feature reader challenges in future issues.

Reply to This Email →

Connect With Us

💬 Follow our journey as we build
→ Connect on LinkedIn: NodeBridge Automation Solutions
→ Follow on X: @nodebridge_dev
→ Subscribe on YouTube: @nodebridge_dev

Bobby R. Goldsmith | Founder, NodeBridge Automation Solutions

P.S. Production-ready doesn't mean perfect. It means you've thought through failures, edge cases, and monitoring. Start with error notifications and input validation. Those two items alone will save you hours of debugging and prevent most disasters. The rest of the checklist can come gradually.

NodeBridge #4: The Prod Checklist, Part 1 (16 Items for Error-Free Workflows)

Runbooks and checklists are your shield against those 3am pings, Slacks, and texts

The Checklist (Part 1 of 2)

1. Error Handling (6 Items)

1.1 Error Notifications

1.2 Try/Catch Blocks on Critical Operations

1.3 Graceful Degradation

1.4 Retry Logic with Exponential Backoff

1.5 Dead Letter Queue (Failed Items Storage)

1.6 Circuit Breaker Pattern

2. Data Validation (5 Items)

2.1 Input Validation on Webhooks

2.2 Sanitize User Input

2.3 Null/Undefined Checks

2.4 Data Type Validation

2.5 Length and Size Limits

3. Monitoring & Logging (5 Items)

3.1 Execution Logging

3.2 Performance Metrics

3.3 Health Check Endpoint

3.4 Alert on Unusual Patterns

3.5 Weekly Summary Reports

Part 1 Checklist Summary

Actions You Can Take This Week

Next Week

Connect With Us

Keep Reading

Fortify with Bashmatica!