Skip to main content
This page defines how the 137th Advisers engineering team responds to production incidents affecting PensionsPortal.ie. It covers detection, triage, resolution, and post-incident review.

Incident Severity Levels

LevelDescriptionExamplesTarget Response
P1 — CriticalFull service unavailable or data breachApplication down, database inaccessible, confirmed breach15 minutes
P2 — HighMajor feature unavailable, data integrity riskAI features failing, auth broken, bulk data error1 hour
P3 — MediumDegraded performance or partial feature failureSlow queries, email delivery failing, non-critical API errors4 hours
P4 — LowMinor issue, no user impactCosmetic bug, low-volume error, non-production issueNext business day

Incident Response Process

1

Detection

Incidents are detected via:
  • Sentry alerts — unhandled exceptions, error rate spikes
  • Uptime monitor — health endpoint failures
  • Cloudflare alerts — WAF block spikes, origin errors
  • User reports — broker or trustee reports via support channel
All alerts route to the on-call engineer via PagerDuty / Slack.
2

Acknowledge and Triage

On-call engineer acknowledges the alert within the target response time and:
  1. Assesses the scope (which tenants affected? how many users?)
  2. Assigns a severity level (P1–P4)
  3. Opens an incident channel in Slack (#incident-<date>)
  4. Notifies the engineering lead for P1/P2 incidents
3

Investigation

  • Check Sentry for error details and affected transactions
  • Check Vercel deployment logs for recent changes
  • Check Neon dashboard for database connectivity and query performance
  • Check Cloudflare analytics for traffic anomalies
  • Review audit logs for suspicious actor activity
4

Contain and Mitigate

Apply the fastest mitigation available:
  • Vercel rollback — for bad deployments
  • Environment variable fix — for misconfiguration
  • Cloudflare rule — for traffic-based attacks
  • Database connection restart — for connection pool exhaustion
  • Feature flag disable — for isolated feature failures
5

Resolve

Confirm resolution by:
  • Verifying health endpoints return 200
  • Checking Sentry error rate has returned to baseline
  • Manually testing the affected user journey
  • Communicating resolution to affected users
6

Post-Incident Review

Within 48 hours of resolution, conduct a blameless post-mortem:
  • Timeline of events
  • Root cause analysis
  • Impact assessment (users affected, data at risk)
  • Action items to prevent recurrence
  • Update runbooks if the incident revealed gaps

Communication Templates

Internal Incident Notification (Slack)

🚨 P1 INCIDENT — [brief description]
Time detected: [HH:MM UTC]
Scope: [affected tenants/features]
On-call: [name]
Incident channel: #incident-[date]
Sentry: [link]

User-Facing Status Update (Email / Banner)

We are currently investigating an issue affecting [feature].
Our team is working to resolve this as a priority.
We will provide an update within [X] hours.

GDPR Breach Notification

If the incident involves actual or suspected exposure of personal data:
  1. Immediately notify the Data Protection Officer
  2. Assess whether the breach is notifiable (likely to result in risk to individuals)
  3. If notifiable: report to the Data Protection Commission within 72 hours
  4. If high risk to individuals: notify affected data subjects without undue delay
See the Security Incident Runbook for the full breach response checklist.

Escalation Matrix

Incident TypePrimaryEscalate To
Application errorOn-call engineerEngineering lead
Data breachEngineering leadCTO + DPO + Legal
Infrastructure outageOn-call engineerEngineering lead + Vendor support
AI safety concernEngineering leadCTO