Files
ai-customer-service/docs/RUNBOOK.md
Your Name c7cb174c58 docs: deliver DO-P1-1 monitoring + DO-P1-2 runbook
DO-P1-1: MONITORING_ALERTING.md
- 8 monitoring coverage items (5xx/reject/handoff/ticket/audit/DB/ready/live)
- K8s liveness/readiness probe config
- Prometheus metrics exposure spec
- Alert thresholds (Prometheus AlertManager YAML)
- Self-healing strategy table

DO-P1-2: RUNBOOK.md
- Pre-flight deployment checklist
- Startup failure troubleshooting (6 scenarios)
- Migration failure troubleshooting
- DB unavailable behavior (production fail-fast)
- Webhook auth debugging guide
- Full rollback procedure (v1.1.0 → v1.0.0)
- 60s health diagnostic script

Gate B now: 6/6 complete 
2026-05-04 09:41:56 +08:00

6.7 KiB
Raw Blame History

DO-P1-2运行与回滚 Runbook

状态: 已交付 负责人DevOps宰相代填 基准P0 完成 Gate B 预生产验证 日期2026-05-04


一、部署前检查清单Pre-flight

# 1. 确认环境变量完整
echo "AI_CS_ENV=$AI_CS_ENV"
echo "AI_CS_POSTGRES_ENABLED=$AI_CS_POSTGRES_ENABLED"
echo "AI_CS_POSTGRES_DSN=${AI_CS_POSTGRES_DSN:+[SET]}"
echo "AI_CS_WEBHOOK_SECRET=${AI_CS_WEBHOOK_SECRET:+[SET]}"
echo "AI_CS_LOG_LEVEL=$AI_CS_LOG_LEVEL"

# 2. 确认 PostgreSQL 可连
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT 1" || exit 1

# 3. 确认 migration 已执行
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT table_name FROM information_schema.tables WHERE table_schema='public' ORDER BY table_name;" | grep -q cs_sessions || { echo "MIGRATION MISSING"; exit 1; }

# 4. 启动服务(后台)
nohup ./ai-customer-service > /var/log/ai-cs.log 2>&1 &
sleep 3

# 5. 验证 ready probe
curl -s http://localhost:8080/ready | grep -q '"status":"UP"' || { echo "READY FAILED"; cat /var/log/ai-cs.log; exit 1; }

二、启动失败排查

症状 原因 解决方案
memory fallback is not allowed ERROR Env=production 但 AI_CS_POSTGRES_ENABLED≠true 设置 AI_CS_POSTGRES_ENABLED=true 并重启
AI_CS_POSTGRES_DSN is required ERROR Env=production 但 DSN 未配置 配置完整 DSNpostgres://user:pass@host:5434/db?sslmode=disable
listen tcp :8080: bind: address already in use 8080 端口被占用 pkill -f ai-customer-service 或改 AI_CS_ADDR=:8081
pq: connection refused PostgreSQL 不可达 检查 PG 主机/端口/防火墙,确认 psql 可连
pq: password authentication failed 密码错误 核对 AI_CS_POSTGRES_DSN 中的密码
启动成功但 /ready 返回 postgres:DOWN PG 连通但 health check 失败 检查 PG 是否在 AI_CS_POSTGRES_DSN 指定端口响应

三、Migration 失败排查

症状 原因 解决方案
pq: relation "cs_sessions" does not exist migration 未执行 手动执行 psql -f db/migration/0001_init.up.sql
pq: duplicate key value violates unique constraint 表已存在但 migration 重跑 migration 已幂等(CREATE TABLE IF NOT EXISTS),忽略即可
pq: permission denied PG 用户无建表权限 确认 ai_cs 用户是 superuser 或拥有 ai_customer_service
# 手动执行 migration
psql "postgres://ai_cs:ai_cs_secret@localhost:5434/ai_customer_service?sslmode=disable" -f db/migration/0001_init.up.sql

四、数据库不可用时的行为

  • Env=production:启动时 config.go 会检查 AI_CS_POSTGRES_ENABLED=true,若 DSN 不可达或认证失败,服务拒绝启动(不会 fallback 到 memory
  • Env=test/development:可设置 AI_CS_POSTGRES_ENABLED=false 使用 memory store测试用

五、Webhook 签名认证联调失败排查

症状 原因 解决方案
CS_AUTH_4034 invalid webhook signature HMAC secret 不匹配 确认上游使用与 AI_CS_WEBHOOK_SECRET 相同的密钥
CS_AUTH_4031 missing webhook signature 上游未传 X-CS-Signature header 检查上游 webhook 发送逻辑
CS_AUTH_4033 stale webhook request 请求时间戳 > MaxSkew默认 300s 确认服务器时间同步NTP或调整 AI_CS_WEBHOOK_MAX_SKEW_SECONDS
偶发性 403 时钟漂移超过 300s 检查服务器时区与 NTP 配置
# 验证签名算法(本地测试)
TS=$(date +%s)
BODY='{"test":"payload"}'
SIG=$(echo -n "${TS}.${BODY}" | openssl dgst -sha256 -hmac "test-secret-123" | awk '{print $2}')
curl -v -X POST http://localhost:8080/api/v1/customer-service/webhook \
  -H "Content-Type: application/json" \
  -H "X-CS-Timestamp: $TS" \
  -H "X-CS-Signature: $SIG" \
  -d "$BODY"

六、回滚操作流程

6.1 版本回滚(从 v1.1.0 回滚到 v1.0.0

# 1. 记录当前版本
echo "Rolling back from $(./ai-customer-service --version) to v1.0.0"

# 2. 停止当前服务
pkill -f "ai-customer-service"
sleep 2

# 3. 备份当前数据库(可选,建议先备份)
PGPASSWORD=ai_cs_secret pg_dump -h localhost -p 5434 -U ai_cs ai_customer_service > /tmp/ai_cs_backup_$(date +%Y%m%d_%H%M%S).sql

# 4. 拉取旧版本镜像 / 二进制
# Docker: docker pull ai-customer-service:v1.0.0
# Binary: 从备份位置获取 v1.0.0 二进制

# 5. 重启服务
nohup ./ai-customer-service-v1.0.0 > /var/log/ai-cs-v1.0.0.log 2>&1 &
sleep 3

# 6. 验证
curl -s http://localhost:8080/ready
curl -s http://localhost:8080/actuator/health

6.2 配置回滚

# 若新配置有问题,恢复环境变量
export AI_CS_POSTGRES_ENABLED=true
export AI_CS_POSTGRES_DSN="postgres://ai_cs:ai_cs_secret@localhost:5434/ai_customer_service?sslmode=disable"
export AI_CS_WEBHOOK_SECRET="previous-secret"
pkill -f "ai-customer-service"
sleep 2
nohup ./ai-customer-service > /var/log/ai-cs.log 2>&1 &

6.3 数据库回滚Migration 不支持向下回滚,需手动处理)

-- 紧急情况:清空所有数据重建(仅 development
TRUNCATE cs_audit_logs, cs_tickets, cs_messages, cs_sessions, cs_message_dedup CASCADE;
-- 然后重启服务,让 migration 重新初始化

七、健康状态快速诊断

#!/bin/bash
# 60s 快速诊断脚本

echo "=== AI-CS Health Diagnostic ==="
echo ""

echo "[1/5] Service process:"
ps aux | grep "ai-customer-service" | grep -v grep || echo "  NOT RUNNING ❌"

echo ""
echo "[2/5] HTTP endpoints:"
for endpoint in "/live" "/ready" "/actuator/health"; do
  status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080$endpoint)
  echo "  $endpoint → HTTP $status $([ "$status" = "200" ] && echo '✅' || echo '❌')"
done

echo ""
echo "[3/5] PostgreSQL:"
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT count(*) as tickets FROM cs_tickets;" 2>&1 | grep -v "^Password" | tail -1

echo ""
echo "[4/5] Recent errors in log:"
tail -50 /var/log/ai-cs.log 2>/dev/null | grep "ERROR" | tail -5 || echo "  No recent errors ✅"

echo ""
echo "[5/5] Webhook test:"
TS=$(date +%s)
BODY='{"channel":"widget","message_id":"diag-001","open_id":"diag-open","content":"health check","timestamp":"2026-05-04T00:00:00Z"}'
SIG=$(echo -n "${TS}.${BODY}" | openssl dgst -sha256 -hmac "test-secret-123" | awk '{print $2}')
curl -s -X POST http://localhost:8080/api/v1/customer-service/webhook \
  -H "Content-Type: application/json" \
  -H "X-CS-Timestamp: $TS" \
  -H "X-CS-Signature: $SIG" \
  -d "$BODY" | head -c 200

echo ""
echo "=== Diagnostic complete ==="