NodeBridge Automation Solutions

Issue 3 • December 30, 2025

Off to the Race Conditions: How to Debug Timing Issues in Production n8n Workflows

⏱️ 14 min read

In This Issue

  • Fail Pattern: The Test Data Trap
  • Why Manual Testing Lies to You
  • Solution: Wait Nodes and Retry Logic
  • How to Debug Race Conditions
  • Coming in Future Issues

You test the workflow. Click "Execute." It runs perfectly. Every node succeeds. Data flows exactly as expected. You deploy to production with confidence.

Three hours later, it fails. Same workflow, same data, same configuration. But now it's returning empty results, skipping records, or crashing midway through.

You test it again manually. Works perfectly. You're baffled.

This might be a race condition. When timing matters but you didn't know it mattered. When your workflow depends on something finishing before the next step begins, but in production, it doesn't wait.

Today, I'm going to show you why workflows that pass manual testing fail in production, how to identify race conditions before they break you, and the exact strategies to fix timing-dependent failures.

Fail Pattern: The Test Data Trap

What Happens:

You build a workflow that fetches data from an API, processes it, and writes results to a database. You test it with your test account. Click "Execute." All green, all good. It works flawlessly.

You deploy to production. Set it to run via webhook call and every hour on a schedule. The first few executions work. Then one fails. Then another. Soon, half your executions are failing with "Cannot read property of undefined" or "No items returned."

You check the logs. The API node succeeded. The processing node succeeded. But somewhere between them, the data fell off.

Why It Happens:

APIs are asynchronous. When you call an API, it doesn't always return data instantly. Sometimes it returns a job ID and says "check back in a few seconds." Sometimes it starts processing and streams results. Sometimes it queues your request and fulfills it when resources are available.

In manual testing, you don't notice this delay. You click Execute, the node waits for the API to respond, and by the time you look at the output, the data is there. The workflow appears to work.

But in production, when workflows run automatically, n8n moves to the next node as soon as the API responds, even if that response is "processing, check back later." The next node expects data. It gets nothing. It fails.

This is the test data trap. Your test data is small, simple, and returns instantly. Production data is large, complex, and takes time to process. The workflow that worked in testing fails in production because of timing you never saw.

Which Operations Cause This:

Almost any external API can have timing issues:

  • Webhooks that trigger on external events: The webhook fires before the data is ready
  • Database queries on large datasets: Query starts instantly but results take seconds to return
  • File processing APIs: Upload succeeds but processing happens asynchronously
  • Report generation APIs: Request accepted but report builds in background
  • AI/ML APIs: Request queued, processing happens later
  • Bulk operations: API accepts batch but processes items over time
  • Third-party integrations: External system responds before data is committed

The pattern is always the same: the API says "okay" but the work isn't done yet.

Real-World Disaster:

You built a workflow that processes customer invoices. When a new invoice arrives via webhook, the workflow:

1. Fetches customer data from your CRM

2. Calculates taxes and discounts

3. Generates a PDF invoice

4. Emails it to the customer

5. Marks the invoice as SENT in your database

In testing with 5 invoices, this works perfectly. You deploy to production.

On Monday morning, 50 invoices arrive within 10 minutes. Your workflow processes them all. But 12 customers email support saying they got blank PDFs.

You investigate. The PDF generation API (step 3) accepted all 50 requests instantly. But it queues requests and processes them over 2-3 minutes. Your workflow didn't wait. It tried to fetch the PDF URL immediately, got a "processing" status, and emailed the customer a blank file.

Manual testing never caught this because you tested with 1-5 invoices that processed instantly. Production load exposed the race condition.

Or worse:

You have a workflow that syncs product inventory from your warehouse system to your ecom site. It runs every 5 minutes. When inventory changes, the warehouse system sends a webhook to n8n, which:

1. Fetches updated inventory counts from warehouse API

2. Updates product quantities in Shopify

3. Logs the sync in a Google Sheet

In testing, this works. In production, you notice something strange: inventory counts are sometimes off by 1-2 units. Not always, just occasionally.

The race condition: the warehouse API updates its database asynchronously. When the webhook fires, it sends the inventory change event before the database update completes. Your workflow fetches inventory immediately and gets the old count. By the time it writes to Shopify, the data is stale.

Customers see "In Stock" when you're actually sold out. They order. You can't fulfill. Refunds, complaints, lost trust.

The Pattern:

1. Build workflow, test manually, works perfectly

2. Deploy to production with real data or real load

3. Race condition appears (async operation, delayed data, processing queue)

4. Workflow fails intermittently or returns incorrect data

5. Manual testing still works (small data, instant responses)

6. You're stuck debugging something you can't reproduce

Why Manual Testing Lies to You

Manual testing creates false confidence. Here's why:

1. You're Testing with Ideal Data

Your test account has 10 records. Production has 10,000. Your test API request returns in 50ms. Production requests take 2 seconds because the database is under load.

When you manually test, you see the best-case scenario. Everything is fast. Everything is small. Everything returns instantly.

Production is messy. Large datasets. Concurrent requests. API rate limits. Database locks. Network latency. Cache misses. Background jobs queued.

Your workflow works in ideal conditions but breaks under real conditions.

2. You're Waiting Without Realizing It

When you click "Execute" and watch nodes turn green, you're waiting. You see the first node succeed. Then the second. Then the third.

That 2-second pause between nodes? That's you waiting for the screen to update. n8n is waiting for the API. The API is processing. By the time you look at the output, the async operation finished.

In production, there's no pause. The workflow executes at full speed. Node 1 finishes. Node 2 starts immediately. If Node 1 triggered an async operation that takes 3 seconds, Node 2 doesn't have the data yet. It fails.

Manual testing hides this because you're slow. Automation is fast.

3. You're Testing Once, Not Repeatedly

You test the workflow once. It works. You deploy.

But race conditions are probabilistic. They don't fail every time. They fail when:

  • The API is under load and responds slower than usual
  • Two workflows run simultaneously and conflict
  • A background job hasn't finished yet
  • A cache expires mid-execution
  • Network latency spikes

Manual testing runs once under ideal conditions. Production runs hundreds of times under varying conditions. The race condition only appears 10% of the time, but that's enough to break production.

4. You're Not Testing Edge Cases

What happens if:

  • The API returns data in a different order?
  • A record is null or empty?
  • The database is locked by another process?
  • The file hasn't finished uploading?
  • The external service hasn't committed the transaction?

Manual testing with clean test data never hits these cases. Production does. Daily.

The Hard Truth:

If your workflow only works in manual testing, it's not production-ready. Production-grade workflows handle timing, concurrency, retries, partial failures, and edge cases you didn't think to test.

Solution: Wait Nodes and Retry Logic

Here's how to fix race conditions in this scenario:

Strategy 1: Add Explicit Wait Nodes

If you know an operation takes time, add a Wait node after it.

Example: PDF Generation

Request PDF generation (returns job_id)
  ↓
Wait node (30 seconds)
  ↓
Fetch PDF using job_id
  ↓
Email PDF to customer

The Wait node gives the PDF service time to finish processing. By the time you fetch the PDF, it's ready.

When to use this:

  • You know the operation is async
  • You know roughly how long it takes
  • The delay is consistent

Strategy 2: Poll Until Ready

If you don't know how long the operation takes, poll the API until it's done.

Example: Report Generation

Request report (returns job_id)
  ↓
Loop node (check status)
  ↓
  IF status = "processing": Wait 5 seconds, loop back
  IF status = "complete": Exit loop, fetch report
  ↓
Download report

This keeps checking until the report is ready, then proceeds.

When to use this:

  • Operation time varies (small reports: 10 seconds, large reports: 5 minutes)
  • API provides a status endpoint
  • You need reliability over speed

Strategy 3: Use Webhooks for Completion

Some APIs support callback webhooks. Instead of polling, you tell the API "call this webhook when done."

Example: Video Transcoding

Upload video (provide webhook URL)
  ↓
[Wait for webhook to fire]
  ↓
Webhook receives completion event
  ↓
Download transcoded video

This is the most efficient method. No polling, no guessing. The API tells you when it's ready.

When to use this:

  • API supports completion webhooks
  • You can expose a webhook endpoint
  • Operation takes a long time (minutes or hours)

Strategy 4: Retry on Failure

If a node fails because data isn't ready, retry it a few times with delays between attempts.

Example: Database Read After Write

Write data to database
  ↓
Read data back (might fail if write hasn't committed)
  ↓
[Retry Settings: 3 attempts, 2 second delay]

n8n's retry settings allow a node to automatically retry if it fails. By the time the third attempt runs, the database write has hopefully committed.

When to use this:

  • You're dealing with eventual consistency
  • Failures are transient (network blips, database locks)
  • You don't control the API (can't add webhooks or status checks)

Strategy 5: Depend on Timestamps, Not Assumptions

Instead of assuming data arrived, check timestamps to verify.

Example: Inventory Sync

Fetch inventory update event
  ↓
Check event timestamp
  ↓
Wait until timestamp + 5 seconds (ensure DB commit finished)
  ↓
Fetch inventory from API
  ↓
Update Shopify

This ensures you're not fetching data before it's committed.

When to use this:

  • You're dealing with event-driven systems
  • You know the commit delay
  • Data integrity is critical

How to Debug Race Conditions

If you suspect a race condition, here's how to confirm and fix it:

Step 1: Add Logging Everywhere

Add Set nodes or Code nodes that log timestamps and data at each step.

// Code node after API call
console.log('Timestamp:', new Date().toISOString());
console.log('API Response:', JSON.stringify($input.all()));
return $input.all();

Run the workflow in production. Check logs. If you see:

  • Empty data where there should be data
  • Timestamps that are too close together
  • Errors that say "undefined" or "null"

You likely have a race condition.

Step 2: Add Wait Nodes to Test

Insert a Wait node (10-30 seconds) after the suspected async operation. Run the workflow again. If it succeeds, you found the race condition.

Step 3: Determine the Right Wait Time

Don't guess. Test with production data. If small datasets take 2 seconds and large datasets take 10 seconds, set your wait to 15 seconds (with buffer).

Or switch to polling (check status every 5 seconds until ready).

Step 4: Test Under Load

Trigger the workflow 10 times in quick succession. If it fails under concurrent load, you have a concurrency issue (database locks, API rate limits, shared resources).

Fix by:

  • Adding queue logic (process one at a time)
  • Using unique identifiers (avoid conflicts)
  • Adding retry logic (handle transient failures)

Step 5: Test with Edge Case Data

Use production data for testing:

  • Large datasets (1000+ records)
  • Empty datasets (0 records)
  • Null values
  • Special characters
  • Concurrent requests

If it fails with any of these, add error handling and validation before the operation that fails.

Quick Wins

Actions You Can Take This Week

🟢 Beginner • 15 min

Audit Your Workflows for Async Operations: Open each production workflow. Identify any node that calls an external API, generates a file, or triggers a background job. Ask: "Does this operation return instantly, or does it process asynchronously?" If async, flag it for testing. You don't need to fix anything yet; just know where the risk is.

🟡 Intermediate • 25 min

Add a Wait Node to Your Riskiest Workflow: Pick the workflow most likely to have a race condition (file processing, report generation, bulk operations). Add a Wait node (15-30 seconds) after the suspected async operation. Deploy and monitor. If failures stop, you found the issue. Now you can tune the wait time or switch to polling.

🟡 Intermediate • 30 min

Build a Polling Loop for One Async Operation: Pick an API that returns a job ID and requires status checking (PDF generation, video processing, etc.). Build a loop that checks status every 5 seconds and exits when complete. Test with production data. This pattern will become your go-to for async operations.

🔴 Advanced • 45 min

Add Execution Logging to All Production Workflows: Create a "Log Execution" subworkflow that writes timestamps, node outputs, and errors to a Google Sheet or database. Call it at critical points in your workflows. When a race condition happens, you'll have detailed logs to diagnose exactly where timing failed. This is how you debug production issues you can't reproduce locally.

Next Week

NodeBridge #4: The Production Checklist (26 Items You're Probably Missing)

Before you deploy any workflow to production, there's a checklist. Error handling, data validation, rate limiting, logging, monitoring, idempotency, retries, alerting, and 18 other things that separate hobby workflows from production-grade automation. Next week, I'll give you the complete checklist and explain why each item matters.

SOON: Watch the companion tutorial on YouTube (subscribe at youtube.com/@nodebridge_dev) where I'll walk through building a polling loop, adding retry logic, and debugging a real race condition step by step.

Got a workflow that works in testing but fails in production?

Reply to this email and describe what's happening. I read every response and often feature reader challenges in future issues. If it's a race condition, I'll help you find it.

Got a broken workflow that's driving you crazy?

Reply to this email and tell me about it. I read every response and often feature reader challenges in future issues.

Reply to This Email →

Bobby R. Goldsmith | Founder, NodeBridge Automation Solutions

P.S. If you've ever said "but it worked when I tested it," I've been there. We've all been there. The solution isn't better testing. It's designing workflows that handle timing, retries, and async operations from the start. That's what we're building toward.

Coming in Future Issues

Issue 5: How to Calculate ROI on Automation

Your manager asks "Was this worth it?" and you freeze. Learn the framework for measuring automation value: time saved, error reduction, opportunity cost, and how to present ROI in terms executives actually care about.

Issue 6: Advanced Error Handling (The 3-Tier System)

We covered basic error notifications in Issue 1. Now we're going deeper: retry logic, circuit breakers, dead letter queues, and how to build workflows that recover from failures automatically.

Issue 7: The Loop Explosion (When Workflows Won't Stop)

You built a loop. It's supposed to process 100 items. It's been running for 6 hours and you're at item 47,832. Something is very, very wrong. How to prevent infinite loops and runaway workflows.

Issue 8: When Your Automation Becomes Your Job

You automated 10 workflows. Now you spend 15 hours a week babysitting them. This is not success. This is a different kind of manual labor. How to build workflows that don't need constant maintenance.

Keep Reading