user-system/docs/guides/MONITORING.md

# 健康检查与监控指南

本文档描述系统健康检查端点、Prometheus 监控指标和告警规则。

---

## 1. 健康检查端点

系统提供三个健康检查端点，适用于不同场景：

| 端点 | 路径 | 说明 | 使用场景 |
|------|------|------|----------|
| 存活探针 | `/health/live` | 确认进程存活 | Kubernetes `livenessProbe` |
| 就绪探针 | `/health/ready` | 确认服务就绪 | Kubernetes `readinessProbe` |
| 健康检查 | `/health` | 综合健康状态 | 负载均衡器、健康检查脚本 |

### 1.1 响应格式

```json
{
  "status": "ok",
  "timestamp": "2026-05-10T13:00:00Z",
  "version": "1.0.0"
}
```

### 1.2 响应码

| 状态 | HTTP 响应码 | 说明 |
|------|-------------|------|
| ok | 200 | 服务正常 |
| degraded | 200 | 服务降级（部分依赖不可用，如 Redis） |
| unhealthy | 503 | 服务不健康（如数据库不可达） |

---

## 2. Prometheus 监控指标

### 2.1 暴露方式

指标端点：`GET /metrics`

返回 Prometheus 格式文本。

### 2.2 核心指标

#### HTTP 指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `http_requests_total` | Counter | method, path, status | HTTP 请求总数 |
| `http_request_duration_seconds` | Histogram | method, path | 请求延迟分布 |

#### 认证指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `login_attempts_total` | Counter | result, method | 登录尝试次数（成功/失败） |
| `active_sessions_total` | Gauge | — | 当前活跃会话数 |
| `refresh_tokens_total` | Counter | — | Token 刷新次数 |

#### 数据库指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `db_query_duration_seconds` | Histogram | operation, table | 数据库查询延迟 |
| `db_connections_open` | Gauge | type | 当前打开的连接数 |
| `db_connections_in_use` | Gauge | type | 使用中的连接数 |

#### 缓存指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `cache_hits_total` | Counter | cache_level | 缓存命中次数 |
| `cache_misses_total` | Counter | cache_level | 缓存未命中次数 |
| `cache_operations_total` | Counter | operation | 缓存操作总数 |

#### 限流指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `ratelimit_rejections_total` | Counter | endpoint, algorithm | 限流拦截次数 |

### 2.3 查看当前指标

```bash
curl http://localhost:8080/metrics
```

---

## 3. 告警规则

### 3.1 建议的告警规则（Prometheus / Alertmanager 格式）

```yaml
groups:
  - name: user-management
    rules:
      # 服务不可用
      - alert: ServiceDown
        expr: up{job="user-management"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "用户管理服务不可用"

      # 错误率过高
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) /
          rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "HTTP 5xx 错误率超过 5%"

      # 登录失败率过高（可能暴力破解）
      - alert: HighLoginFailureRate
        expr: |
          rate(login_attempts_total{result="fail"}[5m]) /
          rate(login_attempts_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "登录失败率超过 80%，可能存在暴力破解"

      # 响应延迟过高
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 响应延迟超过 1 秒"

      # 数据库连接池耗尽
      - alert: DatabaseConnectionPoolExhausted
        expr: db_connections_in_use / db_connections_open > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "数据库连接池使用率超过 90%"

      # 活跃会话数异常下降
      - alert: ActiveSessionsDropped
        expr: |
          active_sessions_total < 10
          and
          delta(active_sessions_total[10m]) < -5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "活跃会话数急剧下降"

      # 限流拦截频繁
      - alert: RateLimitRejectionsHigh
        expr: |
          rate(ratelimit_rejections_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "限流拦截频率过高"
```

---

## 4. Grafana 看板

建议导入以下看板配置：

### 4.1 核心看板指标

**Overview 看板**：
- 请求率（QPS）
- P50/P90/P99 延迟
- 错误率
- 活跃会话数

**Auth 看板**：
- 登录尝试（成功/失败）
- Token 刷新次数
- 活跃会话趋势
- TOTP 启用率

**Database 看板**：
- 查询延迟 P99
- 连接池使用率
- 慢查询数量

**Cache 看板**：
- 命中率
- 未命中率
- L1/L2 缓存对比

---

## 5. 日志关键字监控

建议在日志收集系统（如 Loki/ELK）中配置以下关键字告警：

| 关键字 | 严重程度 | 说明 |
|--------|----------|------|
| `auth: increment login attempts failed` | warning | Redis/L1 缓存不可用 |
| `goroutine leak` | critical | 潜在的 goroutine 泄漏 |
| `token blacklisted but refresh failed` | critical | Token 黑名单写入失败 |
| `password reset code replay` | warning | 可能存在验证码重放 |
| `temporary login token cleanup failed` | warning | 临时令牌清理失败 |
| `cache.Set failed` | warning | 缓存写入失败 |
| `failed to send email` | warning | 邮件发送失败 |

---

## 6. 健康检查脚本示例

```bash
#!/bin/bash
# health_check.sh — 服务健康检查脚本

HEALTH_URL="http://localhost:8080/health"
READY_URL="http://localhost:8080/health/ready"
METRICS_URL="http://localhost:8080/metrics"

check_endpoint() {
    local url=$1
    local name=$2
    local status=$(curl -s -o /dev/null -w "%{http_code}" "$url")

    if [ "$status" -eq 200 ]; then
        echo "[OK] $name: $status"
        return 0
    else
        echo "[FAIL] $name: $status"
        return 1
    fi
}

# 执行检查
failed=0

check_endpoint "$HEALTH_URL" "Health" || failed=$((failed + 1))
check_endpoint "$READY_URL" "Ready" || failed=$((failed + 1))

# 检查 Prometheus 指标端点
status=$(curl -s -o /dev/null -w "%{http_code}" "$METRICS_URL")
if [ "$status" -eq 200 ]; then
    echo "[OK] Metrics: $status"
else
    echo "[WARN] Metrics: $status"
fi

# 检查数据库连接（通过日志）
if grep -q "database opened" logs/app.log 2>/dev/null; then
    echo "[OK] Database: connected"
else
    echo "[FAIL] Database: not connected"
    failed=$((failed + 1))
fi

exit $failed
```

---

## 7. Kubernetes 部署配置示例

```yaml
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: user-management
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

          ports:
            - name: http
              containerPort: 8080
            - name: metrics
              containerPort: 9090

          resources:
            requests:
              memory: "256Mi"
              cpu: "200m"
            limits:
              memory: "1Gi"
              cpu: "1000m"
```

---

*最后更新：2026-05-10*