Incident Severity Levels
| Level | Description | Examples | Target Response |
|---|---|---|---|
| P1 — Critical | Full service unavailable or data breach | Application down, database inaccessible, confirmed breach | 15 minutes |
| P2 — High | Major feature unavailable, data integrity risk | AI features failing, auth broken, bulk data error | 1 hour |
| P3 — Medium | Degraded performance or partial feature failure | Slow queries, email delivery failing, non-critical API errors | 4 hours |
| P4 — Low | Minor issue, no user impact | Cosmetic bug, low-volume error, non-production issue | Next business day |
Incident Response Process
Detection
Incidents are detected via:
- Sentry alerts — unhandled exceptions, error rate spikes
- Uptime monitor — health endpoint failures
- Cloudflare alerts — WAF block spikes, origin errors
- User reports — broker or trustee reports via support channel
Acknowledge and Triage
On-call engineer acknowledges the alert within the target response time and:
- Assesses the scope (which tenants affected? how many users?)
- Assigns a severity level (P1–P4)
- Opens an incident channel in Slack (
#incident-<date>) - Notifies the engineering lead for P1/P2 incidents
Investigation
- Check Sentry for error details and affected transactions
- Check Vercel deployment logs for recent changes
- Check Neon dashboard for database connectivity and query performance
- Check Cloudflare analytics for traffic anomalies
- Review audit logs for suspicious actor activity
Contain and Mitigate
Apply the fastest mitigation available:
- Vercel rollback — for bad deployments
- Environment variable fix — for misconfiguration
- Cloudflare rule — for traffic-based attacks
- Database connection restart — for connection pool exhaustion
- Feature flag disable — for isolated feature failures
Resolve
Confirm resolution by:
- Verifying health endpoints return
200 - Checking Sentry error rate has returned to baseline
- Manually testing the affected user journey
- Communicating resolution to affected users
Communication Templates
Internal Incident Notification (Slack)
User-Facing Status Update (Email / Banner)
GDPR Breach Notification
If the incident involves actual or suspected exposure of personal data:- Immediately notify the Data Protection Officer
- Assess whether the breach is notifiable (likely to result in risk to individuals)
- If notifiable: report to the Data Protection Commission within 72 hours
- If high risk to individuals: notify affected data subjects without undue delay
Escalation Matrix
| Incident Type | Primary | Escalate To |
|---|---|---|
| Application error | On-call engineer | Engineering lead |
| Data breach | Engineering lead | CTO + DPO + Legal |
| Infrastructure outage | On-call engineer | Engineering lead + Vendor support |
| AI safety concern | Engineering lead | CTO |