整理内容: - 删除 60+ 临时测试输出文件 (*.txt) - 移动二进制文件到 bin/ 目录 - 移动 Shell 脚本到 scripts/ 目录 - scripts/dev/: check_gitea.sh, check_sub2api.sh, run_tests.sh - scripts/deploy/: deploy_*.sh, simple_deploy.sh - scripts/ops/: fix_nginx.sh, fix_ssl.sh, install_docker.sh - scripts/test/: test_*.sh, test_*.bat - 移动批处理文件到 scripts/ - 移动 Python 脚本到 tools/ - 清理临时日志文件 保留根目录必要文件: - go.mod, go.sum, go.work - Makefile, docker-compose.yml - .env.example, .gitignore - README.md, AGENTS.md, DEPLOY_GUIDE.md 验证: go build ./... && go test ./... 通过
1054 lines
34 KiB
Markdown
1054 lines
34 KiB
Markdown
# UMS 站点可靠性工程(SRE)全面解决方案
|
||
|
||
> 版本:v1.0 | 日期:2026-04-05 | 审查人:SRE 工程师
|
||
|
||
---
|
||
|
||
## 执行摘要
|
||
|
||
本报告对用户管理系统(UMS)进行了全面的 SRE 审查,涵盖**可靠性基线、可观察性成熟度、告警体系、混沌工程能力、容量规划和自动化运维**六大维度。
|
||
|
||
**当前综合可靠性评级:⚠️ 4.5/10(开发就绪,生产未就绪)**
|
||
|
||
| 维度 | 当前分 | 目标分 | 优先级 |
|
||
|------|--------|--------|--------|
|
||
| SLO 定义 | 0/10 | 8/10 | 🔴 P0 |
|
||
| 可观察性成熟度 | 3/10 | 8/10 | 🔴 P0 |
|
||
| 告警体系 | 4/10 | 8/10 | 🔴 P0 |
|
||
| 错误预算管理 | 0/10 | 7/10 | 🔴 P0 |
|
||
| 混沌工程 | 1/10 | 6/10 | 🟡 P1 |
|
||
| 容量规划 | 2/10 | 7/10 | 🟡 P1 |
|
||
| 运维自动化 | 3/10 | 8/10 | 🟡 P1 |
|
||
|
||
---
|
||
|
||
## 一、系统架构现状审查
|
||
|
||
### 1.1 架构拓扑
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────┐
|
||
│ 前端层 │
|
||
│ React 18 + TypeScript + Ant Design 5 │
|
||
│ (Vite 构建, 无 SSR) │
|
||
└──────────────────────┬──────────────────────────┘
|
||
│ HTTP/REST
|
||
┌──────────────────────▼──────────────────────────┐
|
||
│ API 层 │
|
||
│ Gin HTTP Server (port 8080) │
|
||
│ • 认证中间件 • 速率限制中间件 │
|
||
│ • IP 过滤中间件 • 操作日志中间件 │
|
||
└──────────┬──────────────────────┬───────────────┘
|
||
│ │
|
||
┌──────────▼────────┐ ┌─────────▼──────────────┐
|
||
│ 业务层 (Service) │ │ 缓存层 │
|
||
│ • AuthService │ │ L1: 内存 LRU (10000项) │
|
||
│ • UserService │ │ L2: Redis (可选, 未启用) │
|
||
│ • DeviceService │ └────────────────────────┘
|
||
│ • 异常检测器 │
|
||
└──────────┬────────┘
|
||
│
|
||
┌──────────▼────────────────────────────────────┐
|
||
│ 数据层 │
|
||
│ SQLite (当前运行时, 生产需迁移至 PostgreSQL) │
|
||
│ GORM ORM │
|
||
└───────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 1.2 已有可靠性能力(正向)
|
||
|
||
| 能力 | 现状 |
|
||
|------|------|
|
||
| 健康检查端点 | ✅ `/health`, `/health/live`, `/health/ready` |
|
||
| Prometheus 指标 | ✅ 已定义 metrics.go,但**未接入路由暴露** |
|
||
| Alertmanager 配置 | ✅ 告警规则文件存在,但依赖占位符 |
|
||
| Grafana 仪表盘 | ✅ JSON 文件存在 |
|
||
| 优雅关闭 | ✅ 15s 超时 + Webhook 专属5s |
|
||
| 速率限制 | ✅ 登录/注册/API 三级限流 |
|
||
| 异常检测 | ✅ AnomalyDetector 已接线 |
|
||
| Token 轮换 | ✅ Refresh Token 滚动轮换 |
|
||
| 操作日志 | ✅ 中间件级别审计日志 |
|
||
| 数据库备份演练 | ✅ 脚本已存在 |
|
||
|
||
### 1.3 严重可靠性问题(负向)
|
||
|
||
---
|
||
|
||
## 二、严重问题审查清单
|
||
|
||
### 🔴 CRIT-01:Prometheus 指标端点未接入路由
|
||
|
||
**问题描述:** `metrics.go` 中定义了完整的 Prometheus 指标,但 `main.go` 和 `router.go` 中**没有注册 `/metrics` 端点**。监控系统实际上收集不到任何数据。
|
||
|
||
```go
|
||
// main.go 中缺失:
|
||
// engine.GET("/metrics", promhttp.HandlerFor(registry, promhttp.HandlerOpts{}))
|
||
// 当前 /health 只返回 {"status":"ok"},没有 Prometheus 格式指标
|
||
```
|
||
|
||
**影响:** Alertmanager 告警规则形同虚设,Grafana 仪表盘无数据,所有监控告警全部失效。
|
||
|
||
**修复优先级:** P0 — 必须立即修复
|
||
|
||
---
|
||
|
||
### 🔴 CRIT-02:PrometheusMiddleware 未挂载到路由
|
||
|
||
**问题描述:** `monitoring/middleware.go` 中定义了 `PrometheusMiddleware`,但 `router.go` 的 `Setup()` 方法中**没有调用**,HTTP 请求计数和延迟指标全部为零。
|
||
|
||
**影响:** `HighErrorRate`、`HighResponseTime`、`UnusualAPIRequestRate` 三个核心告警永远不会触发。
|
||
|
||
**修复优先级:** P0
|
||
|
||
---
|
||
|
||
### 🔴 CRIT-03:SLO 完全缺失
|
||
|
||
**问题描述:** 系统没有定义任何 SLO(服务级别目标)。没有 SLO 意味着:
|
||
- 不知道什么样的错误率是"可接受"的
|
||
- 错误预算无法计算,无法指导发布决策
|
||
- 告警阈值缺乏业务依据(当前 5% 错误率阈值是拍脑袋来的)
|
||
|
||
**影响:** 整个可靠性工程体系缺少地基。
|
||
|
||
**修复优先级:** P0
|
||
|
||
---
|
||
|
||
### 🔴 CRIT-04:仅邮件告警,无 On-Call 升级链路
|
||
|
||
**问题描述:** `alertmanager.yml` 中只配置了 email_configs,且收件人地址全是占位符 `${ALERTMANAGER_CRITICAL_TO}`。生产环境:
|
||
- 无即时通知渠道(钉钉/飞书/PagerDuty/企业微信)
|
||
- 无 On-Call 轮班配置
|
||
- Critical 告警和 Warning 告警都发邮件,无差异化响应
|
||
|
||
**影响:** 凌晨 3 点系统宕机,值班工程师无法被及时叫醒。
|
||
|
||
**修复优先级:** P0
|
||
|
||
---
|
||
|
||
### 🔴 CRIT-05:SQLite 用于运行时(单点故障)
|
||
|
||
**问题描述:** 当前 `config.yaml` 配置为 SQLite,这意味着:
|
||
- 无主从复制,无读写分离
|
||
- 写操作串行化(WAL 模式下并发受限)
|
||
- 无法水平扩展
|
||
- 文件级单点故障
|
||
|
||
**影响:** 任何磁盘故障或进程崩溃都会导致完全不可用(SPOF)。
|
||
|
||
**修复优先级:** P0(生产上线前必须迁移至 PostgreSQL)
|
||
|
||
---
|
||
|
||
### 🟡 WARN-01:L1 Cache updateAccessOrder 时间复杂度 O(n)
|
||
|
||
**问题描述:** `l1.go` 中 `updateAccessOrder` 方法使用线性扫描,时间复杂度为 O(n)。当缓存接近 10000 条目时,每次缓存读取都会触发最坏 O(10000) 遍历。
|
||
|
||
```go
|
||
// 当前实现:O(n) 线性扫描
|
||
func (c *L1Cache) updateAccessOrder(key string) {
|
||
for i, k := range c.accessOrder { // 最坏 O(10000) 次遍历
|
||
if k == key { ... }
|
||
}
|
||
}
|
||
```
|
||
|
||
**影响:** 高并发下缓存层成为性能瓶颈,延迟 P99 显著上升。
|
||
|
||
**修复优先级:** P1 — 应改用 container/list 双向链表 + map 实现 O(1) LRU
|
||
|
||
---
|
||
|
||
### 🟡 WARN-02:健康检查未检查 Redis 连接
|
||
|
||
**问题描述:** `health.go` 的 `Check()` 方法只检查数据库,没有检查 Redis 连接状态(当 L2 Cache 启用时)。Redis 故障会导致缓存降级,但健康检查仍返回 UP。
|
||
|
||
**修复优先级:** P1
|
||
|
||
---
|
||
|
||
### 🟡 WARN-03:Webhook 服务 Enabled 硬编码为 false
|
||
|
||
**问题描述:** `main.go` 中:
|
||
```go
|
||
webhookService := service.NewWebhookService(db.DB, service.WebhookServiceConfig{
|
||
Enabled: false, // ← 硬编码!config.yaml 中 webhook.enabled=true 被忽略
|
||
})
|
||
```
|
||
**影响:** Webhook 功能实际上完全禁用,与配置文件不一致。
|
||
|
||
**修复优先级:** P1
|
||
|
||
---
|
||
|
||
### 🟡 WARN-04:缺少分布式追踪(Tracing)
|
||
|
||
**问题描述:** `config.yaml` 中 `monitoring.tracing.enabled: false`,系统完全没有链路追踪能力。当一个请求经过多个 Service 时,无法追踪请求路径。
|
||
|
||
**影响:** 排查跨 Service 问题时,平均恢复时间(MTTR)会大幅增加。
|
||
|
||
**修复优先级:** P1
|
||
|
||
---
|
||
|
||
### 🟡 WARN-05:结构化日志未完整实现
|
||
|
||
**问题描述:** `config.yaml` 定义了 JSON 格式日志,但实际代码中大量使用 `log.Printf`(Go 标准库),不携带 trace_id、request_id、user_id 等上下文字段。
|
||
|
||
**影响:** 日志无法有效聚合查询,排障困难。
|
||
|
||
**修复优先级:** P1
|
||
|
||
---
|
||
|
||
### 🟢 INFO-01:速率限制 Map 无界增长(历史遗留)
|
||
|
||
**问题描述:** 历史代码审查记录中曾提及 Rate limiter map 无界限增长风险。需确认当前实现是否已修复。
|
||
|
||
---
|
||
|
||
## 三、SLO 定义与错误预算
|
||
|
||
### 3.1 SLO 框架
|
||
|
||
```yaml
|
||
# ums-slo.yaml - 用户管理系统服务级别目标
|
||
service: user-management-system
|
||
owner: platform-team
|
||
review_cycle: 30d
|
||
|
||
slos:
|
||
# SLO-1: API 可用性
|
||
- name: api-availability
|
||
description: "有效 HTTP 请求返回非 5xx 响应的比例"
|
||
sli:
|
||
metric: |
|
||
(
|
||
sum(rate(http_requests_total{status!~"5.."}[5m]))
|
||
/
|
||
sum(rate(http_requests_total[5m]))
|
||
)
|
||
target: 99.9% # 每月允许约 43.8 分钟不可用
|
||
window: 30d
|
||
error_budget_minutes: 43.8 # 每月错误预算
|
||
burn_rate_alerts:
|
||
- name: fast-burn-critical
|
||
severity: critical
|
||
short_window: 5m
|
||
long_window: 1h
|
||
burn_rate_factor: 14.4 # 1小时内消耗 2% 错误预算
|
||
page: true
|
||
- name: slow-burn-warning
|
||
severity: warning
|
||
short_window: 30m
|
||
long_window: 6h
|
||
burn_rate_factor: 6 # 6小时内消耗 5% 错误预算
|
||
page: false
|
||
|
||
# SLO-2: API 响应延迟
|
||
- name: api-latency
|
||
description: "P99 请求延迟 < 500ms 的请求比例"
|
||
sli:
|
||
metric: |
|
||
(
|
||
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
|
||
/
|
||
sum(rate(http_request_duration_seconds_count[5m]))
|
||
)
|
||
target: 99%
|
||
window: 30d
|
||
critical_paths:
|
||
- path: "/api/v1/auth/login"
|
||
target: 99.5%
|
||
latency_p99: 300ms
|
||
- path: "/api/v1/auth/refresh"
|
||
target: 99.9%
|
||
latency_p99: 100ms
|
||
burn_rate_alerts:
|
||
- name: latency-fast-burn
|
||
severity: warning
|
||
short_window: 5m
|
||
long_window: 1h
|
||
burn_rate_factor: 14.4
|
||
|
||
# SLO-3: 登录成功率
|
||
- name: login-success-rate
|
||
description: "登录请求成功(非系统错误)的比例"
|
||
sli:
|
||
metric: |
|
||
(
|
||
sum(rate(user_logins_total{status="success"}[5m]))
|
||
/
|
||
sum(rate(user_logins_total[5m]))
|
||
)
|
||
target: 99%
|
||
window: 30d
|
||
notes: "暴力破解导致的合理失败不计入 SLO 违规"
|
||
|
||
# SLO-4: 数据库查询延迟
|
||
- name: db-query-latency
|
||
description: "P95 数据库查询延迟 < 100ms 的比例"
|
||
sli:
|
||
metric: |
|
||
histogram_quantile(0.95,
|
||
sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation)
|
||
) < 0.1
|
||
target: 95%
|
||
window: 30d
|
||
```
|
||
|
||
### 3.2 错误预算政策
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ 错误预算消耗策略 │
|
||
├─────────────────────────────────────────────────────┤
|
||
│ 预算剩余 > 50%:正常发布,可以快速迭代 │
|
||
│ 预算剩余 25-50%:评审每次发布风险,加强测试 │
|
||
│ 预算剩余 10-25%:冻结非关键功能发布,集中修复可靠性 │
|
||
│ 预算剩余 < 10%:仅允许可靠性修复发布,启动事后审查 │
|
||
│ 预算已耗尽:停止所有功能发布,直到下个周期 │
|
||
└─────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## 四、可观察性补强方案
|
||
|
||
### 4.1 三大支柱现状 vs 目标
|
||
|
||
| 支柱 | 现状 | 目标 | 差距 |
|
||
|------|------|------|------|
|
||
| **指标** | 已定义但未暴露 | 完整 Prometheus + Grafana | 接入路由 + 补充业务指标 |
|
||
| **日志** | 标准库 log.Printf | 结构化 JSON + 上下文字段 | 引入 slog/zap + 字段标准化 |
|
||
| **追踪** | 完全缺失 | OpenTelemetry 链路追踪 | 全量接入 |
|
||
|
||
### 4.2 指标补强清单
|
||
|
||
**当前缺失的关键指标:**
|
||
|
||
```go
|
||
// 需要新增的 Prometheus 指标
|
||
var (
|
||
// 错误预算消耗速率(直接从 SLO 派生)
|
||
errorBudgetBurnRate = prometheus.NewGaugeVec(
|
||
prometheus.GaugeOpts{
|
||
Name: "error_budget_burn_rate",
|
||
Help: "Current error budget burn rate multiplier",
|
||
},
|
||
[]string{"slo"},
|
||
)
|
||
|
||
// 缓存命中率(告警规则引用此指标,但当前未定义)
|
||
cacheHitsTotal = prometheus.NewCounterVec(
|
||
prometheus.CounterOpts{
|
||
Name: "cache_hits_total",
|
||
Help: "Total cache hits",
|
||
},
|
||
[]string{"level", "operation"}, // level: l1/l2
|
||
)
|
||
|
||
cacheOperationsTotal = prometheus.NewCounterVec(
|
||
prometheus.CounterOpts{
|
||
Name: "cache_operations_total",
|
||
Help: "Total cache operations",
|
||
},
|
||
[]string{"level", "operation"},
|
||
)
|
||
|
||
// 数据库连接池状态(告警引用但未定义)
|
||
dbConnectionsActive = prometheus.NewGauge(
|
||
prometheus.GaugeOpts{
|
||
Name: "db_connections_active",
|
||
Help: "Active database connections",
|
||
},
|
||
)
|
||
|
||
dbConnectionsMax = prometheus.NewGauge(
|
||
prometheus.GaugeOpts{
|
||
Name: "db_connections_max",
|
||
Help: "Maximum database connections",
|
||
},
|
||
)
|
||
|
||
// 令牌刷新操作
|
||
tokenRefreshTotal = prometheus.NewCounterVec(
|
||
prometheus.CounterOpts{
|
||
Name: "token_refresh_total",
|
||
Help: "Total token refresh attempts",
|
||
},
|
||
[]string{"status"}, // success/failure/rate_limited
|
||
)
|
||
|
||
// 账号锁定事件
|
||
accountLockTotal = prometheus.NewCounter(
|
||
prometheus.CounterOpts{
|
||
Name: "account_lock_total",
|
||
Help: "Total account lockout events",
|
||
},
|
||
)
|
||
|
||
// 异常登录检测
|
||
anomalyDetectedTotal = prometheus.NewCounterVec(
|
||
prometheus.CounterOpts{
|
||
Name: "anomaly_detected_total",
|
||
Help: "Total anomaly login detections",
|
||
},
|
||
[]string{"type"}, // geo_anomaly/device_anomaly/brute_force
|
||
)
|
||
)
|
||
```
|
||
|
||
### 4.3 结构化日志方案
|
||
|
||
**日志字段标准:**
|
||
|
||
```go
|
||
// 每条日志必须携带的上下文字段
|
||
type LogContext struct {
|
||
TraceID string `json:"trace_id"` // OpenTelemetry trace
|
||
SpanID string `json:"span_id"`
|
||
RequestID string `json:"request_id"` // X-Request-ID header
|
||
UserID string `json:"user_id,omitempty"`
|
||
IP string `json:"ip"`
|
||
Method string `json:"method"`
|
||
Path string `json:"path"`
|
||
Duration int64 `json:"duration_ms"`
|
||
Status int `json:"status"`
|
||
Error string `json:"error,omitempty"`
|
||
}
|
||
|
||
// 安全事件专用字段
|
||
type SecurityLogEvent struct {
|
||
EventType string `json:"event_type"` // login_failed/brute_force/anomaly
|
||
Severity string `json:"severity"` // low/medium/high/critical
|
||
UserID string `json:"user_id,omitempty"`
|
||
IP string `json:"ip"`
|
||
DeviceID string `json:"device_id,omitempty"`
|
||
Details string `json:"details"`
|
||
}
|
||
```
|
||
|
||
**推荐接入 `log/slog`(Go 1.21+):**
|
||
|
||
```go
|
||
// 替换 log.Printf → slog
|
||
import "log/slog"
|
||
|
||
// 初始化结构化 logger
|
||
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
|
||
Level: slog.LevelInfo,
|
||
AddSource: false,
|
||
}))
|
||
slog.SetDefault(logger)
|
||
|
||
// 在 Gin middleware 中注入 request_id
|
||
func StructuredLogger() gin.HandlerFunc {
|
||
return func(c *gin.Context) {
|
||
requestID := c.GetHeader("X-Request-ID")
|
||
if requestID == "" {
|
||
requestID = uuid.New().String()
|
||
}
|
||
c.Set("request_id", requestID)
|
||
c.Header("X-Request-ID", requestID)
|
||
|
||
start := time.Now()
|
||
c.Next()
|
||
|
||
slog.Info("http_request",
|
||
"request_id", requestID,
|
||
"method", c.Request.Method,
|
||
"path", c.FullPath(),
|
||
"status", c.Writer.Status(),
|
||
"duration_ms", time.Since(start).Milliseconds(),
|
||
"ip", c.ClientIP(),
|
||
"user_id", c.GetString("user_id"),
|
||
)
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.4 OpenTelemetry 分布式追踪接入
|
||
|
||
```go
|
||
// 最小化追踪接入方案
|
||
import (
|
||
"go.opentelemetry.io/otel"
|
||
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
|
||
"go.opentelemetry.io/otel/sdk/trace"
|
||
)
|
||
|
||
func initTracing(endpoint string, serviceName string) (func(), error) {
|
||
exporter, err := otlptracehttp.New(context.Background(),
|
||
otlptracehttp.WithEndpoint(endpoint),
|
||
otlptracehttp.WithInsecure(),
|
||
)
|
||
if err != nil {
|
||
return nil, err
|
||
}
|
||
|
||
tp := trace.NewTracerProvider(
|
||
trace.WithBatcher(exporter),
|
||
trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1))), // 10% 采样
|
||
)
|
||
otel.SetTracerProvider(tp)
|
||
|
||
return func() { tp.Shutdown(context.Background()) }, nil
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 五、告警体系优化
|
||
|
||
### 5.1 告警分级矩阵
|
||
|
||
| 级别 | 定义 | 响应时间 | 通知渠道 | 示例 |
|
||
|------|------|----------|----------|------|
|
||
| **P0-CRITICAL** | 服务完全不可用,影响所有用户 | 5分钟内 | 电话 + 飞书 + 短信 | 健康检查失败、数据库宕机 |
|
||
| **P1-CRITICAL** | 核心功能降级,错误预算快速燃烧 | 15分钟内 | 飞书 + 短信 | 登录成功率 < 95%、P99 > 2s |
|
||
| **P2-WARNING** | 性能下降,错误预算缓慢消耗 | 1小时内 | 飞书 | 缓存命中率低、内存 > 80% |
|
||
| **P3-INFO** | 趋势异常,需要关注 | 工作时间内 | 邮件 | 在线用户异常、API 量异常 |
|
||
|
||
### 5.2 基于错误预算的燃烧率告警(替代当前阈值告警)
|
||
|
||
**当前问题:** `alerts.yml` 中的告警基于固定阈值(如"错误率 > 5%"),这种方式有两个问题:
|
||
1. **误报多**:短暂流量抖动就触发告警,导致告警疲劳
|
||
2. **漏报多**:长期小幅度超标会耗尽错误预算,但不触发告警
|
||
|
||
**改进方案:使用燃烧率(Burn Rate)告警**
|
||
|
||
```yaml
|
||
# 改进后的 alerts.yml - 基于 SLO 燃烧率
|
||
groups:
|
||
- name: ums-slo-burn-rate
|
||
rules:
|
||
# === SLO-1: API 可用性 燃烧率告警 ===
|
||
# 快速燃烧:1小时消耗 2% 月度错误预算 → 立即告警
|
||
- alert: APIAvailability_FastBurn
|
||
expr: |
|
||
(
|
||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||
/
|
||
sum(rate(http_requests_total[5m]))
|
||
) > (1 - 0.999) * 14.4
|
||
AND
|
||
(
|
||
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
||
/
|
||
sum(rate(http_requests_total[1h]))
|
||
) > (1 - 0.999) * 14.4
|
||
for: 2m
|
||
labels:
|
||
severity: critical
|
||
slo: api-availability
|
||
page: "true"
|
||
annotations:
|
||
summary: "🔴 API 可用性 SLO 快速燃烧 — 立即响应"
|
||
description: |
|
||
错误预算正在以 14.4x 速率消耗(正常速率的14倍)
|
||
当前错误率: {{ $value | humanizePercentage }}
|
||
若持续1小时,将消耗本月 2% 错误预算
|
||
剩余错误预算: 见 Grafana 仪表盘
|
||
运维手册: https://docs/runbook/api-availability
|
||
|
||
# 慢速燃烧:6小时消耗 5% 月度错误预算 → 警告
|
||
- alert: APIAvailability_SlowBurn
|
||
expr: |
|
||
(
|
||
sum(rate(http_requests_total{status=~"5.."}[30m]))
|
||
/
|
||
sum(rate(http_requests_total[30m]))
|
||
) > (1 - 0.999) * 6
|
||
AND
|
||
(
|
||
sum(rate(http_requests_total{status=~"5.."}[6h]))
|
||
/
|
||
sum(rate(http_requests_total[6h]))
|
||
) > (1 - 0.999) * 6
|
||
for: 15m
|
||
labels:
|
||
severity: warning
|
||
slo: api-availability
|
||
page: "false"
|
||
annotations:
|
||
summary: "🟡 API 可用性 SLO 缓慢燃烧 — 需要关注"
|
||
description: |
|
||
错误预算正在以 6x 速率消耗
|
||
若持续6小时,将消耗本月 5% 错误预算
|
||
|
||
# === SLO-2: 延迟 燃烧率告警 ===
|
||
- alert: APILatency_FastBurn
|
||
expr: |
|
||
histogram_quantile(0.99,
|
||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
|
||
) > 0.5 * 14.4
|
||
for: 2m
|
||
labels:
|
||
severity: critical
|
||
slo: api-latency
|
||
page: "true"
|
||
annotations:
|
||
summary: "🔴 API 延迟 SLO 快速燃烧"
|
||
description: "P99 延迟: {{ $value }}s,超过 SLO 阈值 500ms"
|
||
|
||
# === 基础设施告警(保留阈值型) ===
|
||
- alert: ServiceDown
|
||
expr: up{job="user-management"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
page: "true"
|
||
annotations:
|
||
summary: "🚨 服务实例宕机"
|
||
description: "{{ $labels.instance }} 已离线超过 1 分钟"
|
||
|
||
- alert: DatabaseDown
|
||
expr: |
|
||
sum(rate(http_requests_total{status="503"}[2m])) > 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
page: "true"
|
||
annotations:
|
||
summary: "🚨 数据库连接失败"
|
||
|
||
- alert: HighLoginFailureRate_BruteForce
|
||
expr: |
|
||
sum(rate(user_logins_total{status="failed"}[5m]))
|
||
/
|
||
sum(rate(user_logins_total[5m])) > 0.5
|
||
for: 3m
|
||
labels:
|
||
severity: critical
|
||
category: security
|
||
annotations:
|
||
summary: "🔐 疑似暴力破解攻击"
|
||
description: "登录失败率: {{ $value | humanizePercentage }},超过 50%"
|
||
|
||
- alert: TokenRefreshFailureSpike
|
||
expr: |
|
||
sum(rate(token_refresh_total{status="failure"}[5m])) > 10
|
||
for: 2m
|
||
labels:
|
||
severity: warning
|
||
category: auth
|
||
annotations:
|
||
summary: "Token 刷新失败激增"
|
||
|
||
- alert: AnomalyDetectionSpike
|
||
expr: |
|
||
sum(rate(anomaly_detected_total[5m])) > 5
|
||
for: 2m
|
||
labels:
|
||
severity: warning
|
||
category: security
|
||
annotations:
|
||
summary: "异常登录检测激增,可能存在攻击"
|
||
```
|
||
|
||
### 5.3 多通道告警接收配置
|
||
|
||
```yaml
|
||
# alertmanager.yml 优化版(支持飞书 + 企业微信 + 邮件)
|
||
global:
|
||
resolve_timeout: 5m
|
||
slack_api_url: '${ALERTMANAGER_SLACK_API_URL}'
|
||
|
||
route:
|
||
group_by: ['alertname', 'slo', 'category']
|
||
group_wait: 30s
|
||
group_interval: 5m
|
||
repeat_interval: 4h
|
||
receiver: 'default'
|
||
routes:
|
||
# P0: 立即叫醒(飞书 + 短信)
|
||
- match:
|
||
page: "true"
|
||
receiver: 'oncall-page'
|
||
group_wait: 10s
|
||
repeat_interval: 1h
|
||
continue: true
|
||
|
||
# 安全事件:安全团队专属通道
|
||
- match:
|
||
category: security
|
||
receiver: 'security-team'
|
||
group_wait: 30s
|
||
continue: true
|
||
|
||
# Warning:告警群组
|
||
- match:
|
||
severity: warning
|
||
receiver: 'warning-channel'
|
||
continue: false
|
||
|
||
receivers:
|
||
- name: 'oncall-page'
|
||
webhook_configs:
|
||
- url: '${FEISHU_WEBHOOK_URL}'
|
||
send_resolved: true
|
||
http_config:
|
||
bearer_token: '${FEISHU_TOKEN}'
|
||
email_configs:
|
||
- to: '${ONCALL_EMAIL}'
|
||
from: '${ALERT_FROM}'
|
||
smarthost: '${SMTP_HOST}'
|
||
|
||
- name: 'security-team'
|
||
webhook_configs:
|
||
- url: '${SECURITY_FEISHU_WEBHOOK_URL}'
|
||
send_resolved: true
|
||
|
||
- name: 'warning-channel'
|
||
webhook_configs:
|
||
- url: '${WARNING_FEISHU_WEBHOOK_URL}'
|
||
send_resolved: true
|
||
|
||
- name: 'default'
|
||
email_configs:
|
||
- to: '${ALERTMANAGER_DEFAULT_TO}'
|
||
from: '${ALERTMANAGER_FROM}'
|
||
smarthost: '${ALERTMANAGER_SMARTHOST}'
|
||
|
||
inhibit_rules:
|
||
# Critical 抑制同服务 Warning
|
||
- source_match:
|
||
severity: 'critical'
|
||
target_match:
|
||
severity: 'warning'
|
||
equal: ['alertname']
|
||
```
|
||
|
||
---
|
||
|
||
## 六、混沌工程方案
|
||
|
||
### 6.1 混沌工程实施路线图
|
||
|
||
```
|
||
第1阶段(现在):游戏日(Game Day)
|
||
└── 手动故障注入 + 观察系统行为
|
||
└── 目标:发现未知故障模式
|
||
|
||
第2阶段(1个月后):脚本化故障注入
|
||
└── PowerShell/Shell 脚本
|
||
└── 目标:可重复验证
|
||
|
||
第3阶段(3个月后):持续混沌(Continuous Chaos)
|
||
└── 定时自动化故障注入
|
||
└── 目标:回归防护
|
||
```
|
||
|
||
### 6.2 故障注入实验清单
|
||
|
||
| 实验 ID | 故障类型 | 注入方式 | 预期行为 | 验证指标 |
|
||
|---------|----------|----------|----------|----------|
|
||
| CE-001 | 数据库不可用 | 关闭 SQLite 文件句柄 | 返回 503,健康检查降为 DOWN | `health_check_status == DOWN` |
|
||
| CE-002 | Redis 不可用 | 停止 Redis 服务 | 降级到 L1 缓存,业务继续 | 错误率无显著上升 |
|
||
| CE-003 | 高内存压力 | 注入内存泄漏 goroutine | GC 正常运行,不 OOM | `system_goroutines`, 内存告警 |
|
||
| CE-004 | 网络延迟 | 添加人工 sleep | P99 延迟告警触发 | `APILatency_FastBurn` 触发 |
|
||
| CE-005 | 大量并发登录 | 压测工具 | 速率限制正确工作 | 登录接口 429 响应率 |
|
||
| CE-006 | JWT Secret 轮换 | 更换配置重启 | 现有 token 失效优雅处理 | 401 率短暂上升后恢复 |
|
||
| CE-007 | 进程崩溃恢复 | SIGKILL 进程 | 重启后状态恢复 | 服务可用性恢复时间 |
|
||
| CE-008 | 暴力破解攻击 | ab/wrk 高频失败登录 | 账号锁定 + IP 封禁 | `HighLoginFailureRate_BruteForce` |
|
||
|
||
### 6.3 混沌实验脚本(CE-005:并发登录压测)
|
||
|
||
```powershell
|
||
# scripts/chaos/ce-005-concurrent-login.ps1
|
||
# 目标:验证速率限制在高并发下是否正常工作
|
||
|
||
param(
|
||
[string]$BaseURL = "http://localhost:8080",
|
||
[int]$Concurrency = 50,
|
||
[int]$Duration = 30
|
||
)
|
||
|
||
Write-Host "=== CE-005: 并发登录压测 ==="
|
||
Write-Host "目标: $BaseURL"
|
||
Write-Host "并发数: $Concurrency"
|
||
|
||
$results = @{
|
||
total = 0
|
||
success = 0
|
||
rate_limited = 0
|
||
other_error = 0
|
||
}
|
||
|
||
$jobs = 1..$Concurrency | ForEach-Object {
|
||
Start-Job -ScriptBlock {
|
||
param($BaseURL, $Duration)
|
||
$end = (Get-Date).AddSeconds($Duration)
|
||
$local_results = @{ total=0; success=0; rate_limited=0; error=0 }
|
||
|
||
while ((Get-Date) -lt $end) {
|
||
try {
|
||
$body = @{
|
||
account = "testuser_$((Get-Random -Max 1000))"
|
||
password = "wrongpassword"
|
||
} | ConvertTo-Json
|
||
|
||
$resp = Invoke-WebRequest -Uri "$BaseURL/api/v1/auth/login" `
|
||
-Method POST -Body $body -ContentType "application/json" `
|
||
-ErrorAction SilentlyContinue
|
||
|
||
$local_results.total++
|
||
switch ($resp.StatusCode) {
|
||
200 { $local_results.success++ }
|
||
429 { $local_results.rate_limited++ }
|
||
default { $local_results.error++ }
|
||
}
|
||
} catch { $local_results.error++ }
|
||
}
|
||
return $local_results
|
||
} -ArgumentList $BaseURL, $Duration
|
||
}
|
||
|
||
$jobs | Wait-Job | ForEach-Object {
|
||
$r = Receive-Job $_
|
||
$results.total += $r.total
|
||
$results.success += $r.success
|
||
$results.rate_limited += $r.rate_limited
|
||
$results.other_error += $r.error
|
||
}
|
||
|
||
Write-Host "`n=== 压测结果 ==="
|
||
Write-Host "总请求: $($results.total)"
|
||
Write-Host "成功: $($results.success)"
|
||
Write-Host "速率限制(429): $($results.rate_limited)"
|
||
Write-Host "其他错误: $($results.other_error)"
|
||
Write-Host "速率限制比例: $([math]::Round($results.rate_limited / [math]::Max($results.total,1) * 100, 2))%"
|
||
|
||
# 验证:速率限制应该触发
|
||
if ($results.rate_limited -gt 0) {
|
||
Write-Host "`n✅ 实验通过:速率限制正常工作" -ForegroundColor Green
|
||
} else {
|
||
Write-Host "`n❌ 实验失败:速率限制未触发,需要检查配置" -ForegroundColor Red
|
||
exit 1
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 七、容量规划
|
||
|
||
### 7.1 当前资源基线
|
||
|
||
| 资源 | 当前配置 | 预估容量 | 瓶颈风险 |
|
||
|------|----------|----------|----------|
|
||
| 并发用户 | 未测量 | ~500(估算) | 数据库写锁(SQLite) |
|
||
| 内存 | 未监控 | <500MB | 高 |
|
||
| L1 Cache | 10000 条目 | ~100MB | 低 |
|
||
| 速率限制 | 1000 req/min | 16.7 req/s | 取决于业务 |
|
||
| DB 连接池 | 未配置(GORM 默认) | 10 并发 | 高 |
|
||
|
||
### 7.2 扩展路线图
|
||
|
||
```
|
||
当前状态(SQLite 单机)
|
||
↓ 迁移触发条件:并发用户 > 100 或写入 QPS > 50
|
||
PostgreSQL 单主
|
||
↓ 扩展触发条件:读写比 > 4:1 或主库 CPU > 60%
|
||
PostgreSQL 主从(读写分离)
|
||
↓ 扩展触发条件:单机不足支撑峰值
|
||
PostgreSQL 连接池(PgBouncer) + 读副本
|
||
```
|
||
|
||
### 7.3 数据库连接池配置建议
|
||
|
||
```yaml
|
||
# config.yaml 推荐配置(迁移 PostgreSQL 后)
|
||
database:
|
||
postgresql:
|
||
max_open_conns: 50 # 根据 PostgreSQL max_connections 的 1/3 设置
|
||
max_idle_conns: 10 # 保持 max_open_conns 的 20%
|
||
conn_max_lifetime: 1h # 防止连接泄漏
|
||
conn_max_idle_time: 5m # 回收空闲连接
|
||
```
|
||
|
||
---
|
||
|
||
## 八、P0 修复实施计划
|
||
|
||
### 8.1 立即修复(本周内)
|
||
|
||
#### Fix-1:接入 Prometheus 指标端点
|
||
|
||
修改 `cmd/server/main.go`,在路由中注册 `/metrics` 端点:
|
||
|
||
```go
|
||
// 在 router.go 的 Setup() 函数中添加(在 v1 group 之前)
|
||
import (
|
||
"github.com/prometheus/client_golang/prometheus/promhttp"
|
||
"github.com/user-management-system/internal/monitoring"
|
||
)
|
||
|
||
// Setup() 中新增
|
||
metrics := monitoring.GetGlobalMetrics()
|
||
r.engine.Use(monitoring.PrometheusMiddleware(metrics))
|
||
r.engine.GET("/metrics", gin.WrapH(
|
||
promhttp.HandlerFor(metrics.GetRegistry(), promhttp.HandlerOpts{
|
||
EnableOpenMetrics: true,
|
||
}),
|
||
))
|
||
```
|
||
|
||
#### Fix-2:修复健康检查增加 Redis 检查
|
||
|
||
```go
|
||
// health.go 增加 Redis 检查
|
||
func (h *HealthCheck) Check() *Status {
|
||
status := &Status{
|
||
Status: HealthStatusUP,
|
||
Checks: make(map[string]CheckResult),
|
||
}
|
||
|
||
dbResult := h.checkDatabase()
|
||
status.Checks["database"] = dbResult
|
||
if dbResult.Status != HealthStatusUP {
|
||
status.Status = HealthStatusDOWN
|
||
}
|
||
|
||
// 新增:Redis 检查(如果启用)
|
||
if h.redisClient != nil {
|
||
redisResult := h.checkRedis()
|
||
status.Checks["redis"] = redisResult
|
||
// Redis 不可用视为 degraded,不影响主服务状态
|
||
// 但记录为 WARN
|
||
}
|
||
|
||
return status
|
||
}
|
||
```
|
||
|
||
#### Fix-3:修复 Webhook 服务 Enabled 配置
|
||
|
||
```go
|
||
// main.go 修复
|
||
webhookService := service.NewWebhookService(db.DB, service.WebhookServiceConfig{
|
||
Enabled: cfg.Webhook.Enabled, // 从配置读取,不再硬编码
|
||
})
|
||
```
|
||
|
||
### 8.2 本月完成
|
||
|
||
1. 引入结构化日志(slog)替换 log.Printf
|
||
2. 新增缺失的 Prometheus 指标(cache_hits_total 等)
|
||
3. 配置飞书 Webhook 告警通道
|
||
4. 更新 alerts.yml 为燃烧率告警
|
||
5. 执行 CE-001 ~ CE-005 混沌实验并记录结果
|
||
|
||
### 8.3 下季度完成
|
||
|
||
1. 迁移 SQLite → PostgreSQL(生产环境必须)
|
||
2. 接入 OpenTelemetry 分布式追踪
|
||
3. 建立 SLO 仪表盘(Grafana)
|
||
4. 实施错误预算政策,纳入发布流程
|
||
|
||
---
|
||
|
||
## 九、运维手册(Runbook)
|
||
|
||
### Runbook-01:API 可用性下降
|
||
|
||
**触发条件:** `APIAvailability_FastBurn` 告警触发
|
||
|
||
**响应步骤:**
|
||
1. 检查健康检查:`curl http://服务地址/health/ready`
|
||
2. 检查最近部署:`git log --oneline -10`
|
||
3. 检查数据库:`curl http://服务地址/health | jq .checks.database`
|
||
4. 检查错误日志:`tail -100 logs/app.log | grep "ERROR"`
|
||
5. 若数据库异常 → 执行数据库恢复流程
|
||
6. 若最近有部署 → 评估回滚:`git revert HEAD`
|
||
7. 上报状态给用户(若影响 > 5 分钟)
|
||
|
||
**恢复目标:** MTTR < 30分钟
|
||
|
||
---
|
||
|
||
### Runbook-02:疑似暴力破解
|
||
|
||
**触发条件:** `HighLoginFailureRate_BruteForce` 告警触发
|
||
|
||
**响应步骤:**
|
||
1. 查看攻击源 IP:检查登录日志 `GET /api/v1/logs/login`
|
||
2. 确认 IP 封禁已生效:查看 `anomaly_detected_total{type="brute_force"}`
|
||
3. 若 IP 封禁未生效:手动加入 IP 黑名单(ip_security 配置)
|
||
4. 通知安全团队
|
||
5. 评估是否需要临时提高速率限制阈值
|
||
|
||
---
|
||
|
||
### Runbook-03:数据库不可用
|
||
|
||
**触发条件:** `DatabaseDown` 告警触发
|
||
|
||
**响应步骤:**
|
||
1. 立即检查:`sqlite3 data/user_management.db ".tables"`
|
||
2. 若文件损坏:执行备份恢复:
|
||
```powershell
|
||
powershell -ExecutionPolicy Bypass -File scripts/ops/drill-sqlite-backup-restore.ps1
|
||
```
|
||
3. 若进程锁定:检查是否有孤儿进程占用文件
|
||
4. 迁移计划:SQLite 单点是已知风险,立即提升 PostgreSQL 迁移优先级
|
||
|
||
---
|
||
|
||
## 十、SRE 度量指标(季度回顾)
|
||
|
||
| 指标 | 目标 | 测量方法 |
|
||
|------|------|----------|
|
||
| **MTTR**(平均恢复时间) | < 30分钟 | 事件记录 |
|
||
| **MTBF**(平均无故障时间) | > 720小时 | 运行日志 |
|
||
| **错误预算消耗率** | < 50%/月 | Prometheus |
|
||
| **告警噪声比** | < 10%(告警中非实际问题的比例) | 人工评审 |
|
||
| **混沌实验通过率** | > 80% | 实验记录 |
|
||
| **手册完备率** | 每个 P0 告警对应手册 | 文档检查 |
|
||
|
||
---
|
||
|
||
## 附录 A:SRE 工具链建议
|
||
|
||
| 工具 | 用途 | 当前状态 |
|
||
|------|------|----------|
|
||
| Prometheus | 指标采集 | ✅ 已配置(需接路由) |
|
||
| Grafana | 指标可视化 | ✅ 仪表盘已有 |
|
||
| Alertmanager | 告警路由 | ✅ 已配置(需真实通道) |
|
||
| OpenTelemetry | 分布式追踪 | ❌ 缺失 |
|
||
| 飞书/企业微信 Webhook | 即时告警 | ❌ 缺失 |
|
||
| PagerDuty/oncall | On-Call 管理 | ❌ 缺失 |
|
||
| k6/wrk | 压力测试 | ❌ 缺失 |
|
||
| 日志聚合(Loki/ELK) | 日志查询 | ❌ 缺失 |
|
||
|
||
---
|
||
|
||
## 附录 B:快速健康检查命令
|
||
|
||
```powershell
|
||
# 系统整体健康状态
|
||
Invoke-RestMethod -Uri "http://localhost:8080/health/ready"
|
||
|
||
# 检查指标端点(修复后)
|
||
Invoke-RestMethod -Uri "http://localhost:8080/metrics"
|
||
|
||
# 检查登录接口延迟
|
||
Measure-Command { Invoke-RestMethod -Uri "http://localhost:8080/api/v1/auth/capabilities" }
|
||
|
||
# 检查速率限制
|
||
1..10 | ForEach-Object {
|
||
$resp = Invoke-WebRequest -Uri "http://localhost:8080/api/v1/auth/login" `
|
||
-Method POST -Body '{"account":"x","password":"x"}' `
|
||
-ContentType "application/json" -ErrorAction SilentlyContinue
|
||
Write-Host "请求 $_: HTTP $($resp.StatusCode)"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
*本报告由 SRE 工程师完成全面审查,问题分级标准参照 Google SRE Book。所有 P0 问题需在上线前修复,P1 问题需在下一个 Sprint 内修复。*
|
||
|
||
*下次 SLO 回顾日期:2026-05-05*
|