Files
ai-customer-service/docs/RUNBOOK.md

228 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DO-P1-2运行与回滚 Runbook
> 状态:✅ 已交付
> 负责人DevOps宰相代填
> 基准P0 完成 Gate B 预生产验证
> 日期2026-05-04
---
## 0. Gate B 推荐入口
预生产 Gate B 不再建议靠零散手工命令拼接验证。优先使用:
- [scripts/verify_preprod_gate_b.sh](/home/long/project/立交桥/projects/ai-customer-service/scripts/verify_preprod_gate_b.sh)
- 最近一次实测记录:[PREPROD_VERIFICATION_RECORD.md](/home/long/project/立交桥/projects/ai-customer-service/docs/PREPROD_VERIFICATION_RECORD.md)
- Gate C 回滚演练入口:[scripts/verify_gate_c_rollback.sh](/home/long/project/立交桥/projects/ai-customer-service/scripts/verify_gate_c_rollback.sh)
- 最近一次回滚演练记录:[ROLLBACK_DRILL_RECORD.md](/home/long/project/立交桥/projects/ai-customer-service/docs/ROLLBACK_DRILL_RECORD.md)
脚本会完成:
1. 环境变量完整性检查
2. PostgreSQL 连通性检查
3. migration 基线检查
4. 当前源码构建与服务启动
5. `live` / `ready` 探针检查
6. signed webhook 联调
7. dedup 入库验证
8. ticket / audit 入库闭环验证
推荐执行方式:
```bash
AI_CS_RUNTIME_ENV=production \
AI_CS_ADDR=127.0.0.1:18080 \
AI_CS_POSTGRES_ENABLED=true \
AI_CS_POSTGRES_DSN='host=localhost port=5434 user=ai_cs password=ai_cs_secret dbname=ai_customer_service sslmode=disable' \
AI_CS_POSTGRES_MIGRATION_DIR="$PWD/db/migration" \
AI_CS_WEBHOOK_SECRET='replace-with-real-secret' \
AI_CS_WEBHOOK_TIMESTAMP_HEADER='X-CS-Timestamp' \
AI_CS_WEBHOOK_SIGNATURE_HEADER='X-CS-Signature' \
AI_CS_WEBHOOK_MAX_SKEW_SECONDS=300 \
scripts/verify_preprod_gate_b.sh
```
通过标准:
- 脚本退出码为 `0`
- 输出末尾出现 `summary: pass=... fail=0`
- 产物目录中保留 `summary.txt``service.log``webhook_response.json`
---
## 一、部署前检查清单Pre-flight
```bash
# 1. 确认环境变量完整
echo "AI_CS_RUNTIME_ENV=$AI_CS_RUNTIME_ENV"
echo "AI_CS_POSTGRES_ENABLED=$AI_CS_POSTGRES_ENABLED"
echo "AI_CS_POSTGRES_DSN=${AI_CS_POSTGRES_DSN:+[SET]}"
echo "AI_CS_WEBHOOK_SECRET=${AI_CS_WEBHOOK_SECRET:+[SET]}"
echo "AI_CS_LOG_LEVEL=$AI_CS_LOG_LEVEL"
# 2. 确认 PostgreSQL 可连
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT 1" || exit 1
# 3. 确认 migration 已执行
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT table_name FROM information_schema.tables WHERE table_schema='public' ORDER BY table_name;" | grep -q cs_sessions || { echo "MIGRATION MISSING"; exit 1; }
# 4. 启动服务(后台)
nohup ./ai-customer-service > /var/log/ai-cs.log 2>&1 &
sleep 3
# 5. 验证 ready probe
curl -s http://localhost:8080/actuator/health/ready | grep -q '"status":"UP"' || { echo "READY FAILED"; cat /var/log/ai-cs.log; exit 1; }
```
---
## 二、启动失败排查
| 症状 | 原因 | 解决方案 |
|------|------|----------|
| `memory fallback is not allowed` ERROR | Env=production 但 `AI_CS_POSTGRES_ENABLED≠true` | 设置 `AI_CS_POSTGRES_ENABLED=true` 并重启 |
| `AI_CS_POSTGRES_DSN is required` ERROR | Env=production 但 DSN 未配置 | 配置完整 DSN`postgres://user:pass@host:5434/db?sslmode=disable` |
| `listen tcp :8080: bind: address already in use` | 8080 端口被占用 | `pkill -f ai-customer-service` 或改 `AI_CS_ADDR=:8081` |
| `pq: connection refused` | PostgreSQL 不可达 | 检查 PG 主机/端口/防火墙,确认 `psql` 可连 |
| `pq: password authentication failed` | 密码错误 | 核对 `AI_CS_POSTGRES_DSN` 中的密码 |
| 启动成功但 `/actuator/health/ready` 返回 `postgres:DOWN` | PG 连通但 health check 失败 | 检查 PG 是否在 `AI_CS_POSTGRES_DSN` 指定端口响应 |
---
## 三、Migration 失败排查
| 症状 | 原因 | 解决方案 |
|------|------|----------|
| `pq: relation "cs_sessions" does not exist` | migration 未执行 | 手动执行 `psql -f db/migration/0001_init.up.sql` |
| `pq: duplicate key value violates unique constraint` | 表已存在但 migration 重跑 | migration 已幂等(`CREATE TABLE IF NOT EXISTS`),忽略即可 |
| `pq: permission denied` | PG 用户无建表权限 | 确认 `ai_cs` 用户是 superuser 或拥有 `ai_customer_service` 库 |
```bash
# 手动执行 migration
psql "postgres://ai_cs:ai_cs_secret@localhost:5434/ai_customer_service?sslmode=disable" -f db/migration/0001_init.up.sql
```
---
## 四、数据库不可用时的行为
- **Env=production**:启动时 config.go 会检查 `AI_CS_POSTGRES_ENABLED=true`,若 DSN 不可达或认证失败,服务**拒绝启动**(不会 fallback 到 memory
- **Env=test/development**:可设置 `AI_CS_POSTGRES_ENABLED=false` 使用 memory store测试用
---
## 五、Webhook 签名认证联调失败排查
| 症状 | 原因 | 解决方案 |
|------|------|----------|
| `CS_AUTH_4034 invalid webhook signature` | HMAC secret 不匹配 | 确认上游使用与 `AI_CS_WEBHOOK_SECRET` 相同的密钥 |
| `CS_AUTH_4031 missing webhook signature` | 上游未传 `X-CS-Signature` header | 检查上游 webhook 发送逻辑 |
| `CS_AUTH_4033 stale webhook request` | 请求时间戳 > MaxSkew默认 300s | 确认服务器时间同步NTP或调整 `AI_CS_WEBHOOK_MAX_SKEW_SECONDS` |
| 偶发性 403 | 时钟漂移超过 300s | 检查服务器时区与 NTP 配置 |
```bash
# 验证签名算法(本地测试)
TS=$(date +%s)
BODY='{"test":"payload"}'
SIG=$(echo -n "${TS}.${BODY}" | openssl dgst -sha256 -hmac "test-secret-123" | awk '{print $2}')
curl -v -X POST http://localhost:8080/api/v1/customer-service/webhook \
-H "Content-Type: application/json" \
-H "X-CS-Timestamp: $TS" \
-H "X-CS-Signature: $SIG" \
-d "$BODY"
```
---
## 六、回滚操作流程
### 6.1 版本回滚(从 v1.1.0 回滚到 v1.0.0
```bash
# 1. 记录当前版本
echo "Rolling back from $(./ai-customer-service --version) to v1.0.0"
# 2. 停止当前服务
pkill -f "ai-customer-service"
sleep 2
# 3. 备份当前数据库(可选,建议先备份)
PGPASSWORD=ai_cs_secret pg_dump -h localhost -p 5434 -U ai_cs ai_customer_service > /tmp/ai_cs_backup_$(date +%Y%m%d_%H%M%S).sql
# 4. 拉取旧版本镜像 / 二进制
# Docker: docker pull ai-customer-service:v1.0.0
# Binary: 从备份位置获取 v1.0.0 二进制
# 5. 重启服务
nohup ./ai-customer-service-v1.0.0 > /var/log/ai-cs-v1.0.0.log 2>&1 &
sleep 3
# 6. 验证
curl -s http://localhost:8080/actuator/health/ready
curl -s http://localhost:8080/actuator/health
```
### 6.2 配置回滚
```bash
# 若新配置有问题,恢复环境变量
export AI_CS_POSTGRES_ENABLED=true
export AI_CS_POSTGRES_DSN="postgres://ai_cs:ai_cs_secret@localhost:5434/ai_customer_service?sslmode=disable"
export AI_CS_WEBHOOK_SECRET="previous-secret"
pkill -f "ai-customer-service"
sleep 2
nohup ./ai-customer-service > /var/log/ai-cs.log 2>&1 &
```
### 6.3 数据库回滚Migration 不支持向下回滚,需手动处理)
```sql
-- 紧急情况:清空所有数据重建(仅 development
TRUNCATE cs_audit_logs, cs_tickets, cs_messages, cs_sessions, cs_message_dedup CASCADE;
-- 然后重启服务,让 migration 重新初始化
```
---
## 七、健康状态快速诊断
```bash
#!/bin/bash
# 60s 快速诊断脚本
echo "=== AI-CS Health Diagnostic ==="
echo ""
echo "[1/5] Service process:"
ps aux | grep "ai-customer-service" | grep -v grep || echo " NOT RUNNING ❌"
echo ""
echo "[2/5] HTTP endpoints:"
for endpoint in "/actuator/health/live" "/actuator/health/ready" "/actuator/health"; do
status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080$endpoint)
echo " $endpoint → HTTP $status $([ "$status" = "200" ] && echo '✅' || echo '❌')"
done
echo ""
echo "[3/5] PostgreSQL:"
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT count(*) as tickets FROM cs_tickets;" 2>&1 | grep -v "^Password" | tail -1
echo ""
echo "[4/5] Recent errors in log:"
tail -50 /var/log/ai-cs.log 2>/dev/null | grep "ERROR" | tail -5 || echo " No recent errors ✅"
echo ""
echo "[5/5] Webhook test:"
TS=$(date +%s)
BODY='{"channel":"widget","message_id":"diag-001","open_id":"diag-open","content":"health check","timestamp":"2026-05-04T00:00:00Z"}'
SIG=$(echo -n "${TS}.${BODY}" | openssl dgst -sha256 -hmac "test-secret-123" | awk '{print $2}')
curl -s -X POST http://localhost:8080/api/v1/customer-service/webhook \
-H "Content-Type: application/json" \
-H "X-CS-Timestamp: $TS" \
-H "X-CS-Signature: $SIG" \
-d "$BODY" | head -c 200
echo ""
echo "=== Diagnostic complete ==="
```