Files

long-agent f050c60a09 docs: 新增运维和使用指南文档

新增文档：
- guides/ADMIN_GUIDE.md — 管理员操作手册（用户/角色/设备/日志管理）
- guides/USER_GUIDE.md — 普通用户操作手册（注册/登录/TOTP/设备管理）
- guides/CONFIG_REFERENCE.md — 配置文件参考手册（含全部配置项说明）
- guides/MONITORING.md — 健康检查、Prometheus 指标和告警规则

同步更新：
- docs/README.md 文档索引，加入新增文档链接

2026-05-10 13:22:51 +08:00

7.8 KiB

Raw Blame History

健康检查与监控指南

本文档描述系统健康检查端点、Prometheus 监控指标和告警规则。

1. 健康检查端点

系统提供三个健康检查端点，适用于不同场景：

端点	路径	说明	使用场景
存活探针	`/health/live`	确认进程存活	Kubernetes `livenessProbe`
就绪探针	`/health/ready`	确认服务就绪	Kubernetes `readinessProbe`
健康检查	`/health`	综合健康状态	负载均衡器、健康检查脚本

1.1 响应格式

{
  "status": "ok",
  "timestamp": "2026-05-10T13:00:00Z",
  "version": "1.0.0"
}

1.2 响应码

状态	HTTP 响应码	说明
ok	200	服务正常
degraded	200	服务降级（部分依赖不可用，如 Redis）
unhealthy	503	服务不健康（如数据库不可达）

2. Prometheus 监控指标

2.1 暴露方式

指标端点：GET /metrics

返回 Prometheus 格式文本。

2.2 核心指标

HTTP 指标

指标名	类型	标签	说明
`http_requests_total`	Counter	method, path, status	HTTP 请求总数
`http_request_duration_seconds`	Histogram	method, path	请求延迟分布

认证指标

指标名	类型	标签	说明
`login_attempts_total`	Counter	result, method	登录尝试次数（成功/失败）
`active_sessions_total`	Gauge	—	当前活跃会话数
`refresh_tokens_total`	Counter	—	Token 刷新次数

数据库指标

指标名	类型	标签	说明
`db_query_duration_seconds`	Histogram	operation, table	数据库查询延迟
`db_connections_open`	Gauge	type	当前打开的连接数
`db_connections_in_use`	Gauge	type	使用中的连接数

缓存指标

指标名	类型	标签	说明
`cache_hits_total`	Counter	cache_level	缓存命中次数
`cache_misses_total`	Counter	cache_level	缓存未命中次数
`cache_operations_total`	Counter	operation	缓存操作总数

限流指标

指标名	类型	标签	说明
`ratelimit_rejections_total`	Counter	endpoint, algorithm	限流拦截次数

2.3 查看当前指标

curl http://localhost:8080/metrics

3. 告警规则

3.1 建议的告警规则（Prometheus / Alertmanager 格式）

groups:
  - name: user-management
    rules:
      # 服务不可用
      - alert: ServiceDown
        expr: up{job="user-management"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "用户管理服务不可用"

      # 错误率过高
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) /
          rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "HTTP 5xx 错误率超过 5%"

      # 登录失败率过高（可能暴力破解）
      - alert: HighLoginFailureRate
        expr: |
          rate(login_attempts_total{result="fail"}[5m]) /
          rate(login_attempts_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "登录失败率超过 80%，可能存在暴力破解"

      # 响应延迟过高
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 响应延迟超过 1 秒"

      # 数据库连接池耗尽
      - alert: DatabaseConnectionPoolExhausted
        expr: db_connections_in_use / db_connections_open > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "数据库连接池使用率超过 90%"

      # 活跃会话数异常下降
      - alert: ActiveSessionsDropped
        expr: |
          active_sessions_total < 10
          and
          delta(active_sessions_total[10m]) < -5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "活跃会话数急剧下降"

      # 限流拦截频繁
      - alert: RateLimitRejectionsHigh
        expr: |
          rate(ratelimit_rejections_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "限流拦截频率过高"

4. Grafana 看板

建议导入以下看板配置：

4.1 核心看板指标

Overview 看板：

请求率（QPS）
P50/P90/P99 延迟
错误率
活跃会话数

Auth 看板：

登录尝试（成功/失败）
Token 刷新次数
活跃会话趋势
TOTP 启用率

Database 看板：

查询延迟 P99
连接池使用率
慢查询数量

Cache 看板：

命中率
未命中率
L1/L2 缓存对比

5. 日志关键字监控

建议在日志收集系统（如 Loki/ELK）中配置以下关键字告警：

关键字	严重程度	说明
`auth: increment login attempts failed`	warning	Redis/L1 缓存不可用
`goroutine leak`	critical	潜在的 goroutine 泄漏
`token blacklisted but refresh failed`	critical	Token 黑名单写入失败
`password reset code replay`	warning	可能存在验证码重放
`temporary login token cleanup failed`	warning	临时令牌清理失败
`cache.Set failed`	warning	缓存写入失败
`failed to send email`	warning	邮件发送失败

6. 健康检查脚本示例

#!/bin/bash
# health_check.sh — 服务健康检查脚本

HEALTH_URL="http://localhost:8080/health"
READY_URL="http://localhost:8080/health/ready"
METRICS_URL="http://localhost:8080/metrics"

check_endpoint() {
    local url=$1
    local name=$2
    local status=$(curl -s -o /dev/null -w "%{http_code}" "$url")

    if [ "$status" -eq 200 ]; then
        echo "[OK] $name: $status"
        return 0
    else
        echo "[FAIL] $name: $status"
        return 1
    fi
}

# 执行检查
failed=0

check_endpoint "$HEALTH_URL" "Health" || failed=$((failed + 1))
check_endpoint "$READY_URL" "Ready" || failed=$((failed + 1))

# 检查 Prometheus 指标端点
status=$(curl -s -o /dev/null -w "%{http_code}" "$METRICS_URL")
if [ "$status" -eq 200 ]; then
    echo "[OK] Metrics: $status"
else
    echo "[WARN] Metrics: $status"
fi

# 检查数据库连接（通过日志）
if grep -q "database opened" logs/app.log 2>/dev/null; then
    echo "[OK] Database: connected"
else
    echo "[FAIL] Database: not connected"
    failed=$((failed + 1))
fi

exit $failed

7. Kubernetes 部署配置示例

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: user-management
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

          ports:
            - name: http
              containerPort: 8080
            - name: metrics
              containerPort: 9090

          resources:
            requests:
              memory: "256Mi"
              cpu: "200m"
            limits:
              memory: "1Gi"
              cpu: "1000m"

最后更新：2026-05-10

7.8 KiB Raw Blame History Unescape Escape