ChrisCruz.ai Incident Response
Transaction Enrichment September 19, 2025 10 min read

Restoring the Transaction Enrichment Pipeline

Three war-room sessions transformed partial API payloads in the CR environment into a repeatable incident playbook. This is the executive recap, the operating system for remediation, and the questions we still need to ask before the next sprint deadline.

CC
Christopher Manuel Cruz-Guzman

Executive Summary

  • CR-only failure: Transaction enrichment dropped to HTTP 204/206 because ETU stopped populating enrichedMerchantDetail_v3 after upstream payloads shipped without mandatory merchant geography.
  • Two root causes: A failed UAT SOD batch blocked posted transactions, and CDP test data omitted required fields for pending transactions, triggering valid enrichment rejections.
  • Recovery program: Manual batch reruns, conditioned data requests, and enhanced observability restored parity with Production and anchored new triage rituals ahead of the September 30 stability checkpoint.

Key Insights & Signals

The three sessions surfaced patterns that apply far beyond this incident. Use these insights as leading indicators the next time enrichment coverage slips.

Observe the Nulls

Adding enrichedMerchantDetail_v3 to the Splunk dashboard made null payloads undeniable and accelerated the trace from ETU to CDP.

Batch Jobs Bite Back

The UAT SOD job silently failed, deleting evidence until a manual rerun restored posted data. Batch reliability is now a gating item for go-live.

Data Conditioning is Product Work

Every pending transaction must ship with city, state, and zip. Conditioning is now a CDP OKR, not a courtesy.

Detailed Breakdown

Each meeting compounded our understanding. Together they form the timeline for any retro or audit.

Meeting 1 — Instrument the Gap

We confirmed a CR-specific defect: production payloads looked healthy, but CR calls returned partial or empty data. Dan Robinson's multi-source dashboard traced trace IDs across TPS, ETU, and Splunk, exposing null enrichedMerchantDetail_v3 payloads for both pending and posted states.

Diagnostic Flow

  1. Compare Prod vs. CR responses using Chris Cruz's golden JSON.
  2. Query the dashboard with the failing trace ID to locate ETU events.
  3. Validate pending vs. posted status to understand enrichment pathways.

Outcome: Hypothesis that ETU ingested a null payload was confirmed, triggering requests for fresh pending data and instrumentation upgrades.

Immediate Challenges

  • Pending transactions decay into posted records within days, shrinking the test window.
  • Null enrichment data masked downstream logic, giving false positives in automation suites.
  • CR environment instability made root-cause isolation slower than Prod.

Meeting 2 — Fix the Feeds

With the enriched fields still dark, Jacob (Columbus) uncovered a failed SOD batch that halted posted enrichment, while Ajay traced an incomplete CDP payload for pending flows.

Posted Transactions

  • SOD batch crashed in UAT; no enrichment records propagated.
  • Manual rerun restored merchant metadata but dropped some historical activity, now a BLR ticket.
  • New monitoring added to confirm batch completion before QA sign-off.

Pending Transactions

  • CDP payload arrived without merchantState, failing enrichment validation.
  • Transactions with full address data enriched successfully, proving service health.
  • Action: Kwame to ship conditioned data sets with city, state, and zip.

Two-Track Recovery Framework

Track 1 — Batch Integrity: Monitor scheduled jobs, store completion evidence, and keep rollback scripts ready for reruns.

Track 2 — Payload Conditioning: Validate upstream schemas before QA, with automated checks for mandatory fields across pending/posting states.

This framework now complements our Option B architecture guardrails to keep data contracts honest.

Meeting 3 — Secure the Delivery Window

Once enrichment resumed, we shifted to program risk. Production remained stable, but CR's volatility and the September 30 deadline forced a conversation about evidence collection, triage cadence, and the looming CDP Spark migration.

Program Risks to Track

  • Evidence deficit: Without steady pending data, we cannot prove readiness for DDA Going Primary.
  • Coordination tax: Daily triage with utilities, digital channels, and BLR is now mandatory office hours.
  • Latency expectations: Teams must plan around the inherent 2–3 minute enrichment lag (1 minute CDP batch + 2 minute enrichment retry).
  • Capacity conflict: CDP Spark migration competes for the same engineers critical to enrichment QA.

These risks mirror lessons from our dashboard strategy: make bottlenecks visible, assign owners, and time-box decisions.

Implementation Guide

Deploy this in your own sandbox or UAT environment to harden enrichment before the next release.

  1. Instrument the core fields: Extend dashboards with enrichedMerchantDetail_v3 status, merchant geography completeness, and batch job run history.
  2. Codify conditioned data requests: Publish a test-data checklist (state, city, postal code, MCC) and require CDP sign-off before QA cycles begin.
  3. Automate payload linting: Add schema validation to ingestion pipelines, similar to the automated checks outlined in our ETD testing blueprint.
  4. Schedule daily triage: Treat the utilities sync as open office hours; log every action item with timestamps and owners in Confluence.
  5. Protect historical data: After rerunning batch jobs, run reconciliation queries to confirm no records were nulled; escalate anomalies to BLR immediately.

Quick-Start Checklist

  • 🔍 Trace IDs mapped from API logs to Splunk in under 2 minutes.
  • 🧾 Daily ledger of pending vs. posted coverage with evidence screenshots.
  • 🤝 Named owners for CDP conditioning, ETU enrichment, Digital Channels validation.
  • 📦 Backup of pre-rerun transaction tables to guard against data drops.
  • 📊 Latency dashboard refreshed every 15 minutes for triage visibility.

Reflection Questions

  • Where does your current observability stack surface null enrichment payloads before customers do?
  • How are pending transaction lifecycles protected from data staleness during long triage windows?
  • What governance mechanism forces upstream teams to ship conditioned data as part of definition of ready?
  • How will the CDP Spark migration (or equivalent platform initiatives) cannibalize enrichment QA capacity, and what is your contingency?

Atomic Notes

Observability

  • HTTP 204/206 flagged the silent failure.
  • Trace IDs kept Prod vs. CR comparisons honest.
  • Add merchant geography completeness to dashboards.

Data Integrity

  • CDP payloads must include city, state, zip.
  • Backups required before rerunning batch jobs.
  • Label DDA primary defects distinctly for recall.

Coordination

  • Daily triage sync acts as office hours.
  • Invite Digital Channels as optional to avoid overload.
  • Real-time fixes require cross-team presence.

Strategic Pressure

  • September 30 readiness date is at risk without evidence.
  • Option C is deprecated; Option B remains the north star.
  • Data center migration competes for CDP bandwidth.