Incident Response - PensionsPortal.ie

This page defines how the 137th Advisers engineering team responds to production incidents affecting PensionsPortal.ie. It covers detection, triage, resolution, and post-incident review.

Incident Severity Levels

Level	Description	Examples	Target Response
P1 — Critical	Full service unavailable or data breach	Application down, database inaccessible, confirmed breach	15 minutes
P2 — High	Major feature unavailable, data integrity risk	AI features failing, auth broken, bulk data error	1 hour
P3 — Medium	Degraded performance or partial feature failure	Slow queries, email delivery failing, non-critical API errors	4 hours
P4 — Low	Minor issue, no user impact	Cosmetic bug, low-volume error, non-production issue	Next business day

Incident Response Process

Detection

Incidents are detected via:

Sentry alerts — unhandled exceptions, error rate spikes
Uptime monitor — health endpoint failures
Cloudflare alerts — WAF block spikes, origin errors
User reports — broker or trustee reports via support channel

All alerts route to the on-call engineer via PagerDuty / Slack.

Acknowledge and Triage

On-call engineer acknowledges the alert within the target response time and:

Assesses the scope (which tenants affected? how many users?)
Assigns a severity level (P1–P4)
Opens an incident channel in Slack (#incident-<date>)
Notifies the engineering lead for P1/P2 incidents

Investigation

Check Sentry for error details and affected transactions
Check Vercel deployment logs for recent changes
Check Neon dashboard for database connectivity and query performance
Check Cloudflare analytics for traffic anomalies
Review audit logs for suspicious actor activity

Contain and Mitigate

Apply the fastest mitigation available:

Vercel rollback — for bad deployments
Environment variable fix — for misconfiguration
Cloudflare rule — for traffic-based attacks
Database connection restart — for connection pool exhaustion
Feature flag disable — for isolated feature failures

Resolve

Confirm resolution by:

Verifying health endpoints return 200
Checking Sentry error rate has returned to baseline
Manually testing the affected user journey
Communicating resolution to affected users

Post-Incident Review

Within 48 hours of resolution, conduct a blameless post-mortem:

Timeline of events
Root cause analysis
Impact assessment (users affected, data at risk)
Action items to prevent recurrence
Update runbooks if the incident revealed gaps

Communication Templates

Internal Incident Notification (Slack)

🚨 P1 INCIDENT — [brief description]
Time detected: [HH:MM UTC]
Scope: [affected tenants/features]
On-call: [name]
Incident channel: #incident-[date]
Sentry: [link]

We are currently investigating an issue affecting [feature].
Our team is working to resolve this as a priority.
We will provide an update within [X] hours.

If the incident involves actual or suspected exposure of personal data:

Immediately notify the Data Protection Officer
Assess whether the breach is notifiable (likely to result in risk to individuals)
If notifiable: report to the Data Protection Commission within 72 hours
If high risk to individuals: notify affected data subjects without undue delay

See the Security Incident Runbook for the full breach response checklist.

Escalation Matrix

Incident Type	Primary	Escalate To
Application error	On-call engineer	Engineering lead
Data breach	Engineering lead	CTO + DPO + Legal
Infrastructure outage	On-call engineer	Engineering lead + Vendor support
AI safety concern	Engineering lead	CTO

​Incident Severity Levels

​Incident Response Process

​Communication Templates

​Internal Incident Notification (Slack)

​User-Facing Status Update (Email / Banner)

​GDPR Breach Notification

​Escalation Matrix

Incident Severity Levels

Incident Response Process

Communication Templates

Internal Incident Notification (Slack)

User-Facing Status Update (Email / Banner)

GDPR Breach Notification

Escalation Matrix