# 健康检查与监控指南 本文档描述系统健康检查端点、Prometheus 监控指标和告警规则。 --- ## 1. 健康检查端点 系统提供三个健康检查端点,适用于不同场景: | 端点 | 路径 | 说明 | 使用场景 | |------|------|------|----------| | 存活探针 | `/health/live` | 确认进程存活 | Kubernetes `livenessProbe` | | 就绪探针 | `/health/ready` | 确认服务就绪 | Kubernetes `readinessProbe` | | 健康检查 | `/health` | 综合健康状态 | 负载均衡器、健康检查脚本 | ### 1.1 响应格式 ```json { "status": "ok", "timestamp": "2026-05-10T13:00:00Z", "version": "1.0.0" } ``` ### 1.2 响应码 | 状态 | HTTP 响应码 | 说明 | |------|-------------|------| | ok | 200 | 服务正常 | | degraded | 200 | 服务降级(部分依赖不可用,如 Redis) | | unhealthy | 503 | 服务不健康(如数据库不可达) | --- ## 2. Prometheus 监控指标 ### 2.1 暴露方式 指标端点:`GET /metrics` 返回 Prometheus 格式文本。 ### 2.2 核心指标 #### HTTP 指标 | 指标名 | 类型 | 标签 | 说明 | |--------|------|------|------| | `http_requests_total` | Counter | method, path, status | HTTP 请求总数 | | `http_request_duration_seconds` | Histogram | method, path | 请求延迟分布 | #### 认证指标 | 指标名 | 类型 | 标签 | 说明 | |--------|------|------|------| | `login_attempts_total` | Counter | result, method | 登录尝试次数(成功/失败) | | `active_sessions_total` | Gauge | — | 当前活跃会话数 | | `refresh_tokens_total` | Counter | — | Token 刷新次数 | #### 数据库指标 | 指标名 | 类型 | 标签 | 说明 | |--------|------|------|------| | `db_query_duration_seconds` | Histogram | operation, table | 数据库查询延迟 | | `db_connections_open` | Gauge | type | 当前打开的连接数 | | `db_connections_in_use` | Gauge | type | 使用中的连接数 | #### 缓存指标 | 指标名 | 类型 | 标签 | 说明 | |--------|------|------|------| | `cache_hits_total` | Counter | cache_level | 缓存命中次数 | | `cache_misses_total` | Counter | cache_level | 缓存未命中次数 | | `cache_operations_total` | Counter | operation | 缓存操作总数 | #### 限流指标 | 指标名 | 类型 | 标签 | 说明 | |--------|------|------|------| | `ratelimit_rejections_total` | Counter | endpoint, algorithm | 限流拦截次数 | ### 2.3 查看当前指标 ```bash curl http://localhost:8080/metrics ``` --- ## 3. 告警规则 ### 3.1 建议的告警规则(Prometheus / Alertmanager 格式) ```yaml groups: - name: user-management rules: # 服务不可用 - alert: ServiceDown expr: up{job="user-management"} == 0 for: 1m labels: severity: critical annotations: summary: "用户管理服务不可用" # 错误率过高 - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "HTTP 5xx 错误率超过 5%" # 登录失败率过高(可能暴力破解) - alert: HighLoginFailureRate expr: | rate(login_attempts_total{result="fail"}[5m]) / rate(login_attempts_total[5m]) > 0.8 for: 5m labels: severity: warning annotations: summary: "登录失败率超过 80%,可能存在暴力破解" # 响应延迟过高 - alert: HighLatency expr: | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "P99 响应延迟超过 1 秒" # 数据库连接池耗尽 - alert: DatabaseConnectionPoolExhausted expr: db_connections_in_use / db_connections_open > 0.9 for: 5m labels: severity: critical annotations: summary: "数据库连接池使用率超过 90%" # 活跃会话数异常下降 - alert: ActiveSessionsDropped expr: | active_sessions_total < 10 and delta(active_sessions_total[10m]) < -5 for: 5m labels: severity: warning annotations: summary: "活跃会话数急剧下降" # 限流拦截频繁 - alert: RateLimitRejectionsHigh expr: | rate(ratelimit_rejections_total[5m]) > 10 for: 5m labels: severity: warning annotations: summary: "限流拦截频率过高" ``` --- ## 4. Grafana 看板 建议导入以下看板配置: ### 4.1 核心看板指标 **Overview 看板**: - 请求率(QPS) - P50/P90/P99 延迟 - 错误率 - 活跃会话数 **Auth 看板**: - 登录尝试(成功/失败) - Token 刷新次数 - 活跃会话趋势 - TOTP 启用率 **Database 看板**: - 查询延迟 P99 - 连接池使用率 - 慢查询数量 **Cache 看板**: - 命中率 - 未命中率 - L1/L2 缓存对比 --- ## 5. 日志关键字监控 建议在日志收集系统(如 Loki/ELK)中配置以下关键字告警: | 关键字 | 严重程度 | 说明 | |--------|----------|------| | `auth: increment login attempts failed` | warning | Redis/L1 缓存不可用 | | `goroutine leak` | critical | 潜在的 goroutine 泄漏 | | `token blacklisted but refresh failed` | critical | Token 黑名单写入失败 | | `password reset code replay` | warning | 可能存在验证码重放 | | `temporary login token cleanup failed` | warning | 临时令牌清理失败 | | `cache.Set failed` | warning | 缓存写入失败 | | `failed to send email` | warning | 邮件发送失败 | --- ## 6. 健康检查脚本示例 ```bash #!/bin/bash # health_check.sh — 服务健康检查脚本 HEALTH_URL="http://localhost:8080/health" READY_URL="http://localhost:8080/health/ready" METRICS_URL="http://localhost:8080/metrics" check_endpoint() { local url=$1 local name=$2 local status=$(curl -s -o /dev/null -w "%{http_code}" "$url") if [ "$status" -eq 200 ]; then echo "[OK] $name: $status" return 0 else echo "[FAIL] $name: $status" return 1 fi } # 执行检查 failed=0 check_endpoint "$HEALTH_URL" "Health" || failed=$((failed + 1)) check_endpoint "$READY_URL" "Ready" || failed=$((failed + 1)) # 检查 Prometheus 指标端点 status=$(curl -s -o /dev/null -w "%{http_code}" "$METRICS_URL") if [ "$status" -eq 200 ]; then echo "[OK] Metrics: $status" else echo "[WARN] Metrics: $status" fi # 检查数据库连接(通过日志) if grep -q "database opened" logs/app.log 2>/dev/null; then echo "[OK] Database: connected" else echo "[FAIL] Database: not connected" failed=$((failed + 1)) fi exit $failed ``` --- ## 7. Kubernetes 部署配置示例 ```yaml apiVersion: apps/v1 kind: Deployment spec: template: spec: containers: - name: user-management livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 10 periodSeconds: 15 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 ports: - name: http containerPort: 8080 - name: metrics containerPort: 9090 resources: requests: memory: "256Mi" cpu: "200m" limits: memory: "1Gi" cpu: "1000m" ``` --- *最后更新:2026-05-10*