Billing Support × Billing Engineering

Escalation Process Review

Aligning on a move from case-by-case escalations to incident-based management · Daniel Anselmo (Global Lead, Billing Customer Incident Support) & Stijn Zweegers (Senior Billing Engineering Manager)

Why we met

With new leadership over the FINTECH team, we opened a joint review of the billing escalation process — moving from a rough, individual understanding to a concrete plan for reducing unnecessary escalations and improving alignment between Customer Incident Support (CSUP) and Billing Engineering.

The core problem — repetitive escalations don't scale

~800

cases linked to a single incident (R2 / Zero Trust)

100s

duplicate escalations for one known issue

Human-dependent

escalation decisions vary by individual skill

Today, cases are escalated individually even when the workaround or root cause is already known. That consumes capacity better spent on root-cause fixes and innovation. An efficient earlier model (Epic Jira + tables of contents for bulk fixes) was discontinued for unknown reasons.

Current vs. target operating model

flowchart LR
    subgraph CUR["CURRENT — case-by-case"]
      direction TB
      A1["Customer cases"] --> A2["Individual escalation
per case"]
      A2 --> A3["Billing Eng repeats
an already-known fix"]
      A3 --> A4["Hundreds of duplicate
escalations per issue"]
    end
    subgraph TGT["TARGET — incident-based"]
      direction TB
      B1["Customer cases"] --> B2["Confirm issue once"]
      B2 --> B3["Incident record +
public status page"]
      B3 --> B4["RCA ticket: root cause,
fix ETA, data cleanup, severity"]
      B4 --> B5["Customers informed →
fewer new cases"]
    end
    CUR -.->|"shift focus"| TGT
    classDef cur fill:#fce8e6,stroke:#ea4335
    classDef tgt fill:#e6f4ea,stroke:#34a853
    class A1,A2,A3,A4 cur
    class B1,B2,B3,B4,B5 tgt

Once an issue is confirmed, shift from individual escalations to an incident with public status updates — keeping customers informed and preventing further case creation. CSUP effort refocuses on investigation, status, and root-cause identification.

What we agreed

Define incident criteria

A formal policy distinguishing a bug from an incident: cosmetic/visual issues do not warrant an incident; anything blocking subscriptions or payments does. Removes the sustainability problem of high incident volume.

Prioritize beyond escalation counts

Every root-cause bug gets a dedicated RCA ticket (code issue, fix timeline, data cleanup, severity). Prioritization weights:

Customer sizeHybrid / enterpriseCustomer ageTotal impact

So critical enterprise clients (e.g. Notion) are addressed first. Escalation count alone has proven an incomplete metric.

Controlled tooling for CSUP

A safe, controlled UI for the team to action accounts (e.g. clearing billing flags) without exposing high-risk commands. Enables autonomous resolution, reduces engineering load, and speeds resolution times.

Close tooling & training gaps

CSUP currently lacks visibility into subscription lifecycle, Ninja Panel, and Stripe, forcing reliance on hunting down individuals. Documentation, training, and system access are needed to operate as a true frontline engineering team.

Engineering initiatives in progress

Anomaly detection (Goa): smarter, dynamic thresholds for incident creation — moving away from fixed triggers like "5 cases/day".
Data classification (Town): classifying Salesforce cases to understand root causes and compare support data against technical billing data — a path to precise, eventually automated incident creation, with care to avoid false positives that overwhelm engineering.

Next steps & ownership

Action	Owner
Refine the escalation → incident transition	CSUP + Eng (w/ Leslie)
Draft the bug-vs-incident policy	Joint
Build prioritization framework (size, segment, age, impact)	Billing Eng
Continue Town classification & Goa anomaly detection	Billing Eng
Scope a controlled CSUP account-actioning UI	Joint
Address CSUP tooling, documentation & training gaps	CSUP

The outcome we're targeting

Fewer duplicate escalations, faster resolution, customers proactively informed, and engineering freed to fix root causes — with CSUP operating as an empowered frontline team rather than a relay for repetitive escalations.