Issue 4 • January 6, 2026 Runbooks and checklists are your shield against those 3am pings, Slacks, and texts⏱️ 12 min read In This Issue
You built a workflow. It's solid green. You deploy it. Stakeholders begin depending on it. Two weeks later, it breaks at 3am. No error notification. You find out when the marketing manager messages you at 7am asking why the sales report didn't arrive. You scramble to fix it, but you don't know what broke because there are no logs. You fix the immediate issue. Three days later, the workflow breaks again. Different error. This time it's a rate limit you didn't know existed. Your workflow made 1,000 API calls in 10 minutes. Your account is temporarily banned. You fix that. A week later, a user sends malformed data through a webhook. Your workflow crashes. It tries to process the same bad data 50 times before you even notice. Your database is now full of duplicate garbage records. This is what happens when you skip the production checklist. Your workflow works in testing, but it's not ready for production. Production means edge cases, malicious input, API failures, rate limits, concurrent executions, and a hundred other things you didn't test for. Today I'm starting a two-part production deployment checklist: 26 items that separate hobby workflows from production-grade automation. This week covers the first 16 items across Error Handling, Data Validation, and Monitoring. Next week we'll cover Performance, Security, and Deployment. If you're deploying workflows that other people depend on, this checklist is not optional. The Checklist (Part 1 of 2)This week covers 3 categories: Error Handling, Data Validation, and Monitoring & Logging. Each item includes what it is, why it matters, and how to implement it. 1. Error Handling (6 Items)1.1 Error NotificationsWhat it is: Send yourself an alert when the workflow fails. Why it matters: If you don't know it failed, you can't fix it. Users will report failures before you discover them. How to implement: Add an error trigger workflow that sends Slack/email alerts on any workflow failure. We covered this in Issue #1. Test: Intentionally break a node. Verify you receive an alert within 1 minute. 1.2 Try/Catch Blocks on Critical OperationsWhat it is: Wrap risky operations (API calls, database writes) in error handlers so failures don't crash the workflow. Why it matters: One failed API call shouldn't stop the entire workflow. Isolate failures. How to implement: Use "Continue on Fail" on nodes that might fail, then check for errors in the next node with an IF condition. Example:
Test: Send invalid data to the API. Verify workflow continues and logs the error. 1.3 Graceful DegradationWhat it is: When a non-critical service fails, the workflow continues with reduced functionality instead of stopping. Why it matters: If your analytics logging fails, the customer order should still process. Prioritize critical operations. How to implement: Separate critical path (order processing) from nice-to-have (analytics tracking). Use error handlers to skip non-critical failures. Example:
Test: Disable analytics API. Verify order still processes successfully. 1.4 Retry Logic with Exponential BackoffWhat it is: If an API call fails, retry it 3-5 times with increasing delays between attempts. Why it matters: Network blips and temporary API outages are common. Retrying often succeeds. How to implement: Node settings -> Retry on Fail -> Max Tries: 3-5, Wait Between Tries: 2000ms (increase for each retry). Advanced: Implement exponential backoff (wait 2s, then 4s, then 8s) for persistent failures. Test: Temporarily disable an API. Verify node retries before failing. 1.5 Dead Letter Queue (Failed Items Storage)What it is: When an item fails processing, store it in a separate location (database, Google Sheet, queue) for manual review instead of losing it. Why it matters: Failed items contain valuable data. You need to review and reprocess them, not discard them. How to implement: On error, write the failed item to a "failed_items" table or sheet with timestamp and error message. Example:
Test: Send data that triggers a failure. Verify it's logged in the dead letter queue. 1.6 Circuit Breaker PatternWhat it is: If an API fails 5+ times in a row, stop calling it for 5-10 minutes instead of hammering it with requests. Why it matters: Repeated failed calls waste resources, trigger rate limits, and delay recovery. Give the failing service time to recover. How to implement: Track failure count in a database or variable. If failures exceed threshold, skip API calls for a cooldown period. Advanced: Use n8n's built-in rate limiting or implement custom logic with timestamp checks. Test: Force 5 consecutive failures. Verify workflow stops calling the API and resumes after cooldown. 2. Data Validation (5 Items)2.1 Input Validation on WebhooksWhat it is: Validate all incoming webhook data before processing. Check data types, required fields, and format. Why it matters: Malicious or malformed data will crash your workflow. Webhooks are public endpoints; anyone can send anything. How to implement: Add a validation node immediately after webhook trigger. Check for required fields, validate email format, sanitize strings. Example:
Test: Send webhook with missing fields, invalid email, or malicious data. Verify workflow rejects it. 2.2 Sanitize User InputWhat it is: Remove or escape special characters from user input to prevent injection attacks (SQL injection, script injection). Why it matters: A user sending How to implement: Use code nodes to sanitize strings before using them in queries or APIs. Example:
Test: Send input with special characters (script tags, SQL commands). Verify they're sanitized. 2.3 Null/Undefined ChecksWhat it is: Check if values exist before using them in operations. Why it matters: "Cannot read property 'name' of undefined" is the #1 production error. Check before accessing nested properties. How to implement: Use optional chaining or explicit null checks. Example:
Test: Send data with missing nested fields. Verify workflow handles it gracefully. 2.4 Data Type ValidationWhat it is: Verify numbers are numbers, dates are dates, arrays are arrays. Why it matters: APIs often return unexpected types. A string "123" is not a number; math operations will fail. How to implement: Check types before operations. Example:
Test: Send string values for numeric fields. Verify workflow converts or rejects them. 2.5 Length and Size LimitsWhat it is: Enforce maximum lengths on strings, arrays, and file uploads. Why it matters: A user uploading a 500MB file will crash your workflow. A description field with 100,000 characters will break your database. How to implement: Check lengths before processing. Example:
Test: Send oversized data. Verify workflow rejects it with clear error message. 3. Monitoring & Logging (5 Items)3.1 Execution LoggingWhat it is: Log every workflow execution with timestamp, input data, and result (success/failure). Why it matters: When something breaks, logs are how you debug. Without logs, you're guessing. How to implement: Add a "Log Execution" node at the start and end of the workflow. Write to Google Sheet, database, or file. Example:
Test: Run workflow 5 times. Verify all executions are logged. 3.2 Performance MetricsWhat it is: Track how long each workflow takes to execute and how many items it processes. Why it matters: Performance degradation is a warning sign. If a workflow that used to take 2 seconds now takes 30 seconds, something is wrong. How to implement: Log execution start time and end time. Calculate duration. Example:
Test: Run workflow multiple times. Verify duration is logged and consistent. 3.3 Health Check EndpointWhat it is: A simple workflow that returns "OK" when triggered, proving n8n is running. Why it matters: External monitoring tools (UptimeRobot, Pingdom) can check if your n8n instance is alive. How to implement: Create a workflow with webhook trigger that returns Example:
Test: Call the endpoint. Verify it returns 200 OK. 3.4 Alert on Unusual PatternsWhat it is: Send alerts when metrics deviate from normal (execution time 10x longer, failure rate above 5%, etc.). Why it matters: Catches problems before they become disasters. A gradual increase in execution time indicates a growing dataset or performance issue. How to implement: Track baseline metrics. Alert when current execution exceeds baseline by X%. Example:
Test: Artificially slow down workflow (add long wait). Verify alert fires. 3.5 Weekly Summary ReportsWhat it is: Send yourself a weekly email with workflow statistics: executions, failures, avg duration, most common errors. Why it matters: Proactive monitoring. You spot trends before they become critical failures. How to implement: Create a scheduled workflow (runs every Monday 8am) that queries execution logs and sends summary email. Example:
Test: Run manually. Verify email contains accurate statistics. Part 1 Checklist SummaryError Handling:
Data Validation:
Monitoring & Logging:
Quick Wins Actions You Can Take This Week🟢 Beginner • 20 min Print This Checklist: Copy the checklist above. Print it. For your most critical workflow, go through each item and mark what you have vs. what's missing. You don't need to fix everything today; just know your gaps. Most people discover they're missing 10+ items from just these first 16. 🟡 Intermediate • 30 min Add Error Notifications to All Production Workflows: If you haven't done this yet (Issue #1), do it now. This is the single most valuable item on the checklist. Every production workflow should send an alert when it fails. No exceptions. 🟡 Intermediate • 45 min Implement Input Validation on Your Riskiest Webhook: Pick the webhook that processes the most critical data (orders, payments, user signups). Add a validation node that checks for required fields, validates data types, and sanitizes input. Test by sending malformed data and verifying it's rejected. 🔴 Advanced • 90 min Build a Weekly Summary Report: Create a scheduled workflow that runs every Monday morning and emails you a summary of the past week: total executions per workflow, failure rates, average execution time, and top 5 errors. This gives you visibility into patterns and trends you'd otherwise miss. Next WeekNodeBridge #5: The Production Checklist, Part 2 (10 More Items) We covered Error Handling, Data Validation, and Monitoring. Next week we finish the checklist with Performance & Limits (rate limiting, batch processing, timeouts, idempotency), Security (webhook auth, environment variables, least privilege), and Deployment (version control, staging, documentation). Plus the complete 26-item printable checklist. SOON: Watch the companion tutorials on YouTube (subscribe at youtube.com/@nodebridge_dev) where I'll walk through setting up execution logging, building a health check endpoint, and implementing retry logic with exponential backoff. Got a production workflow that's missing items from this checklist? Reply and tell me which items you're struggling to implement. I read every response and often create deep-dives or tutorials based on what readers need most. Got a broken workflow that's driving you crazy? Reply to this email and tell me about it. I read every response and plan to feature reader challenges in future issues. Reply to This Email →Connect With Us
💬 Follow our journey as we build Bobby R. Goldsmith | Founder, NodeBridge Automation Solutions P.S. Production-ready doesn't mean perfect. It means you've thought through failures, edge cases, and monitoring. Start with error notifications and input validation. Those two items alone will save you hours of debugging and prevent most disasters. The rest of the checklist can come gradually. |
