Files
user-system/docs/guides/MONITORING.md
long-agent f050c60a09 docs: 新增运维和使用指南文档
新增文档:
- guides/ADMIN_GUIDE.md — 管理员操作手册(用户/角色/设备/日志管理)
- guides/USER_GUIDE.md — 普通用户操作手册(注册/登录/TOTP/设备管理)
- guides/CONFIG_REFERENCE.md — 配置文件参考手册(含全部配置项说明)
- guides/MONITORING.md — 健康检查、Prometheus 指标和告警规则

同步更新:
- docs/README.md 文档索引,加入新增文档链接
2026-05-10 13:22:51 +08:00

319 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 健康检查与监控指南
本文档描述系统健康检查端点、Prometheus 监控指标和告警规则。
---
## 1. 健康检查端点
系统提供三个健康检查端点,适用于不同场景:
| 端点 | 路径 | 说明 | 使用场景 |
|------|------|------|----------|
| 存活探针 | `/health/live` | 确认进程存活 | Kubernetes `livenessProbe` |
| 就绪探针 | `/health/ready` | 确认服务就绪 | Kubernetes `readinessProbe` |
| 健康检查 | `/health` | 综合健康状态 | 负载均衡器、健康检查脚本 |
### 1.1 响应格式
```json
{
"status": "ok",
"timestamp": "2026-05-10T13:00:00Z",
"version": "1.0.0"
}
```
### 1.2 响应码
| 状态 | HTTP 响应码 | 说明 |
|------|-------------|------|
| ok | 200 | 服务正常 |
| degraded | 200 | 服务降级(部分依赖不可用,如 Redis |
| unhealthy | 503 | 服务不健康(如数据库不可达) |
---
## 2. Prometheus 监控指标
### 2.1 暴露方式
指标端点:`GET /metrics`
返回 Prometheus 格式文本。
### 2.2 核心指标
#### HTTP 指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `http_requests_total` | Counter | method, path, status | HTTP 请求总数 |
| `http_request_duration_seconds` | Histogram | method, path | 请求延迟分布 |
#### 认证指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `login_attempts_total` | Counter | result, method | 登录尝试次数(成功/失败) |
| `active_sessions_total` | Gauge | — | 当前活跃会话数 |
| `refresh_tokens_total` | Counter | — | Token 刷新次数 |
#### 数据库指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `db_query_duration_seconds` | Histogram | operation, table | 数据库查询延迟 |
| `db_connections_open` | Gauge | type | 当前打开的连接数 |
| `db_connections_in_use` | Gauge | type | 使用中的连接数 |
#### 缓存指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `cache_hits_total` | Counter | cache_level | 缓存命中次数 |
| `cache_misses_total` | Counter | cache_level | 缓存未命中次数 |
| `cache_operations_total` | Counter | operation | 缓存操作总数 |
#### 限流指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `ratelimit_rejections_total` | Counter | endpoint, algorithm | 限流拦截次数 |
### 2.3 查看当前指标
```bash
curl http://localhost:8080/metrics
```
---
## 3. 告警规则
### 3.1 建议的告警规则Prometheus / Alertmanager 格式)
```yaml
groups:
- name: user-management
rules:
# 服务不可用
- alert: ServiceDown
expr: up{job="user-management"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "用户管理服务不可用"
# 错误率过高
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "HTTP 5xx 错误率超过 5%"
# 登录失败率过高(可能暴力破解)
- alert: HighLoginFailureRate
expr: |
rate(login_attempts_total{result="fail"}[5m]) /
rate(login_attempts_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "登录失败率超过 80%,可能存在暴力破解"
# 响应延迟过高
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 响应延迟超过 1 秒"
# 数据库连接池耗尽
- alert: DatabaseConnectionPoolExhausted
expr: db_connections_in_use / db_connections_open > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "数据库连接池使用率超过 90%"
# 活跃会话数异常下降
- alert: ActiveSessionsDropped
expr: |
active_sessions_total < 10
and
delta(active_sessions_total[10m]) < -5
for: 5m
labels:
severity: warning
annotations:
summary: "活跃会话数急剧下降"
# 限流拦截频繁
- alert: RateLimitRejectionsHigh
expr: |
rate(ratelimit_rejections_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "限流拦截频率过高"
```
---
## 4. Grafana 看板
建议导入以下看板配置:
### 4.1 核心看板指标
**Overview 看板**
- 请求率QPS
- P50/P90/P99 延迟
- 错误率
- 活跃会话数
**Auth 看板**
- 登录尝试(成功/失败)
- Token 刷新次数
- 活跃会话趋势
- TOTP 启用率
**Database 看板**
- 查询延迟 P99
- 连接池使用率
- 慢查询数量
**Cache 看板**
- 命中率
- 未命中率
- L1/L2 缓存对比
---
## 5. 日志关键字监控
建议在日志收集系统(如 Loki/ELK中配置以下关键字告警
| 关键字 | 严重程度 | 说明 |
|--------|----------|------|
| `auth: increment login attempts failed` | warning | Redis/L1 缓存不可用 |
| `goroutine leak` | critical | 潜在的 goroutine 泄漏 |
| `token blacklisted but refresh failed` | critical | Token 黑名单写入失败 |
| `password reset code replay` | warning | 可能存在验证码重放 |
| `temporary login token cleanup failed` | warning | 临时令牌清理失败 |
| `cache.Set failed` | warning | 缓存写入失败 |
| `failed to send email` | warning | 邮件发送失败 |
---
## 6. 健康检查脚本示例
```bash
#!/bin/bash
# health_check.sh — 服务健康检查脚本
HEALTH_URL="http://localhost:8080/health"
READY_URL="http://localhost:8080/health/ready"
METRICS_URL="http://localhost:8080/metrics"
check_endpoint() {
local url=$1
local name=$2
local status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
if [ "$status" -eq 200 ]; then
echo "[OK] $name: $status"
return 0
else
echo "[FAIL] $name: $status"
return 1
fi
}
# 执行检查
failed=0
check_endpoint "$HEALTH_URL" "Health" || failed=$((failed + 1))
check_endpoint "$READY_URL" "Ready" || failed=$((failed + 1))
# 检查 Prometheus 指标端点
status=$(curl -s -o /dev/null -w "%{http_code}" "$METRICS_URL")
if [ "$status" -eq 200 ]; then
echo "[OK] Metrics: $status"
else
echo "[WARN] Metrics: $status"
fi
# 检查数据库连接(通过日志)
if grep -q "database opened" logs/app.log 2>/dev/null; then
echo "[OK] Database: connected"
else
echo "[FAIL] Database: not connected"
failed=$((failed + 1))
fi
exit $failed
```
---
## 7. Kubernetes 部署配置示例
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: user-management
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
ports:
- name: http
containerPort: 8080
- name: metrics
containerPort: 9090
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "1000m"
```
---
*最后更新2026-05-10*