docs: project docs, scripts, deployment configs, and evidence
This commit is contained in:
87
docs/guides/ALERTING_ONCALL_RUNBOOK.md
Normal file
87
docs/guides/ALERTING_ONCALL_RUNBOOK.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# 告警与值班 Runbook
|
||||
|
||||
更新日期:2026-03-24
|
||||
|
||||
## 目标
|
||||
|
||||
- 统一用户管理系统的告警分级、响应时限、升级路径与恢复验证动作
|
||||
- 让“有告警规则”转变为“有处理流程、可追责、可复盘”
|
||||
|
||||
## 当前边界
|
||||
|
||||
- 仓库内已存在 Prometheus 告警规则与 Alertmanager 路由配置
|
||||
- 仓库内已补齐本地结构校验材料
|
||||
- 仓库内已补齐 Alertmanager 模板渲染路径:
|
||||
- [`deployment/alertmanager/alertmanager.yml`](/D:/project/deployment/alertmanager/alertmanager.yml)
|
||||
- [`deployment/alertmanager/alertmanager.env.example`](/D:/project/deployment/alertmanager/alertmanager.env.example)
|
||||
- [`scripts/ops/render-alertmanager-config.ps1`](/D:/project/scripts/ops/render-alertmanager-config.ps1)
|
||||
- 仓库内已补齐严格的 live-delivery drill 入口:
|
||||
- [`scripts/ops/drill-alertmanager-live-delivery.ps1`](/D:/project/scripts/ops/drill-alertmanager-live-delivery.ps1)
|
||||
- the script fails closed on unresolved placeholders, `example.*` values, and placeholder secrets
|
||||
- the script stores only redacted config output and masked recipient information in evidence artifacts
|
||||
- 当前仍未闭环的是外部通知通道真实接入证据;需要把模板变量渲染为真实联系人、真实 SMTP/通知通道和真实密钥来源
|
||||
|
||||
## 严重级别
|
||||
|
||||
- `critical`
|
||||
- 典型场景:高错误率、数据库连接池耗尽、高内存
|
||||
- 目标响应:5 分钟内确认,15 分钟内给出处置方向
|
||||
- `warning`
|
||||
- 典型场景:高响应时间、高登录失败率、低缓存命中率
|
||||
- 目标响应:15 分钟内确认,60 分钟内恢复或降级
|
||||
- `info`
|
||||
- 典型场景:在线用户数偏低、请求量异常
|
||||
- 目标响应:工作时间内确认,纳入趋势分析
|
||||
|
||||
## 标准处理流程
|
||||
|
||||
1. 接警后确认 `alertname`、`severity`、`service`、开始时间和当前值。
|
||||
2. 检查基础健康:
|
||||
- `GET /health`
|
||||
- `GET /health/ready`
|
||||
- `GET /api/v1/auth/capabilities`
|
||||
3. 如涉及登录/后台主链路,执行:
|
||||
- `cd frontend/admin && npm.cmd run e2e:full:win`
|
||||
4. 对照指标判断是瞬时抖动、配置错误、发布回归还是依赖故障。
|
||||
5. 若为发布回归,直接进入回滚流程:
|
||||
- [`ROLLBACK_RUNBOOK.md`](/D:/project/docs/guides/ROLLBACK_RUNBOOK.md)
|
||||
6. 故障恢复后记录根因、影响范围、恢复时间、后续永久修复项。
|
||||
|
||||
## 升级路径
|
||||
|
||||
1. 一线值班先确认告警是否真实、是否影响核心用户路径。
|
||||
2. `critical` 未在 15 分钟内止血,升级到应用负责人和平台负责人。
|
||||
3. 涉及数据一致性、备份恢复、跨版本回滚时,升级到 DBA/平台发布负责人。
|
||||
4. 需要对外沟通时,由服务 owner 输出统一事故通报。
|
||||
|
||||
## 发布前检查
|
||||
|
||||
- 告警规则结构校验通过
|
||||
- Alertmanager 路由接收者已替换为真实联系人与真实 SMTP/通知通道
|
||||
- Alertmanager 模板已完成渲染,且渲染产物不再包含 `${ALERTMANAGER_*}` 未解析变量
|
||||
- live-delivery drill 已使用真实 env 注入执行成功,并形成红acted evidence
|
||||
- 最新基线低于阈值,不存在“发布即告警”
|
||||
- 回滚脚本和备份恢复脚本可执行
|
||||
|
||||
## 本地校验
|
||||
|
||||
- 告警包校验脚本:
|
||||
- [`scripts/ops/validate-alerting-package.ps1`](/D:/project/scripts/ops/validate-alerting-package.ps1)
|
||||
- 告警渲染演练脚本:
|
||||
- [`scripts/ops/drill-alertmanager-render.ps1`](/D:/project/scripts/ops/drill-alertmanager-render.ps1)
|
||||
- 告警真实投递演练脚本:
|
||||
- [`scripts/ops/drill-alertmanager-live-delivery.ps1`](/D:/project/scripts/ops/drill-alertmanager-live-delivery.ps1)
|
||||
- use a real env file or process environment; `alertmanager.env.example` is expected to fail closed and cannot be used as closure evidence
|
||||
- 最新校验证据:
|
||||
- 校验执行后会生成 `docs/evidence/ops/<date>/alerting/ALERTING_PACKAGE_<timestamp>.md`
|
||||
- 渲染演练执行后会生成 `docs/evidence/ops/<date>/alerting/<timestamp>/ALERTMANAGER_RENDER_DRILL.md`
|
||||
- live-delivery drill 执行后会生成 `docs/evidence/ops/<date>/alerting/<timestamp>/ALERTMANAGER_LIVE_DELIVERY_DRILL.md`
|
||||
|
||||
## 关联材料
|
||||
|
||||
- 本地观测基线:
|
||||
- [`docs/evidence/ops/2026-03-24/observability/LOCAL_BASELINE_20260324-090637.md`](/D:/project/docs/evidence/ops/2026-03-24/observability/LOCAL_BASELINE_20260324-090637.md)
|
||||
- 回滚 runbook:
|
||||
- [`ROLLBACK_RUNBOOK.md`](/D:/project/docs/guides/ROLLBACK_RUNBOOK.md)
|
||||
- 项目当前真实状态:
|
||||
- [`REAL_PROJECT_STATUS.md`](/D:/project/docs/status/REAL_PROJECT_STATUS.md)
|
||||
Reference in New Issue
Block a user