docs: enrich environment issues analysis and correct repair plan status

- Expand TEST_ENVIRONMENT_ISSUES.md with detailed root cause analysis,
  resolution paths, and diagnostic commands for all 5 environment issues
- Add docs/experts/00_PROJECT_OVERVIEW.md with full project landscape
  (3 services, key files, security posture, test state, constraints)
- Correct SYSTEMATIC_REPAIR_PLAN: P0-1 and P0-2 are actually fixed
  via validateStartupSecurity() in bootstrap.go (not residual issues)
- All P0/P1 fixes confirmed verified against source code
This commit is contained in:
Your Name
2026-04-18 09:34:21 +08:00
parent 0d81a53b7a
commit 8fcdfe400e
3 changed files with 313 additions and 58 deletions

View File

@@ -1,82 +1,148 @@
# Test Environment Issues
This document describes pre-existing test failures that are **environment-related**, not code bugs. They cannot be fixed by code changes alone.
## Issue 1: `TestTokenStoreIntegration` — module not found
**Symptom:**
```
module lijiaoqiao/platform-token-runtime is not in GOROOT (/usr/lib/go-1.22/src/lijiaoqiao/platform-token-runtime)
```
**Root Cause:** The Go module path is `lijiaoqiao/platform-token-runtime` but the GOPATH is not configured for this module path structure. Go 1.22 requires either:
- The module to be in `$GOPATH/src/lijiaoqiao/platform-token-runtime`, OR
- `go.mod` to be properly resolvable via `GOPATH/pkg/mod`
The system's GOPATH (`/usr/lib/go-1.22`) does not contain the `lijiaoqiao/` prefix path.
**Fix:** Set `GOPATH` to include the correct path structure, or use `go work` to resolve the module.
> **说明**:以下均为**环境配置问题**,非代码缺陷。通过运维/基础设施配置解决,代码无需修改。
---
## Issue 2: `TestAuditLogExporter` — etcd client connection
## Issue 1: `TestTokenStoreIntegration` — Go module not found in GOROOT
**Symptom:**
**测试**`platform-token-runtime` 内的集成测试
**症状**
```
module lijiaoqiao/platform-token-runtime is not in GOROOT
(/usr/lib/go-1.22/src/lijiaoqiao/platform-token-runtime)
```
**根因分析**
Go 工具链解析模块时,会按以下顺序查找:
1. 优先使用 `go.mod` 声明的 `module path`(已正确定义为 `lijiaoqiao/platform-token-runtime`
2.`GOPATH` 模式下Go 会尝试将 module path 当作文件系统路径在 `$GOPATH/src/` 下查找
当前系统 Go 1.22 的 `GOPATH``/usr/lib/go-1.22`,不存在 `lijiaoqiao/platform-token-runtime` 子目录,因此 `go test ./...` 在 GOPATH 模式下会报 not found。
`go build ./...`module-aware 模式)不受此影响,因为 module path 不依赖 GOPATH 路径结构。
**解决路径**(任选其一):
1. **推荐**`go work` 在仓库根目录创建 work file将三个模块挂载到同一 workspace消除 GOPATH 依赖
2. 短解:运行 `go test ./...` 时,显式加 `GOFLAGS=-mod=mod` 强制 module-aware 模式
3. CI 中设置 `GOPATH` 包含正确路径结构(如 `/home/long/go`),并将代码放在 `$GOPATH/src/lijiaoqiao/`
---
## Issue 2: `TestAuditLogExporter` — etcd broker connection refused
**测试**`platform-token-runtime` 内某个 exporter 测试
**症状**
```
dial tcp 127.0.0.1:2379: connect: connection refused
```
**Root Cause:** The test requires a running etcd instance on `127.0.0.1:2379`. The etcd binary is not running in the test environment. This is an infrastructure dependency, not a code defect.
**根因分析**
测试代码尝试连接本地 etcd broker默认端口 2379作为审计日志后端。测试环境未启动 etcd 进程。这属于**基础设施缺失**,非代码问题。
**Fix:** Start an etcd server (`etcd`) on the default port before running this test.
**解决路径**
1. 本地开发:启动 Docker etcd 容器 `docker run -p 2379:2379 quay.io/coreos/etcd`
2. CI 环境:用 `docker-compose` 在测试 job 前启动 etcd 服务
3. 隔离测试:若只想跑单元逻辑,用 build tag 跳过需要 etcd 的集成测试用例
---
## Issue 3: `TestIntegrationPipeline` — Kafka consumer timeout
**Symptom:**
**测试**`supply-api` 内端到端集成测试
**症状**
```
kafka server: waited 5s for messages: context deadline exceeded
```
**Root Cause:** The integration test requires a running Kafka broker on `localhost:9092`. The Kafka broker is not running in the test environment. The test waits for messages on a Kafka topic but the broker is absent, causing a context deadline exceeded error.
**根因分析**
测试向 Kafka topic 发送消息并等待消费者处理。测试环境没有运行 Kafka broker默认端口 9092消费者在超时时间内未收到消息导致 context deadline exceeded。
**Fix:** Start a Kafka broker (e.g., via Docker: `docker run -p 9092:9092 apache/kafka`) before running this test.
**解决路径**
1. 本地开发:启动 Docker Kafka 容器(或用 `strimzi` kafka 镜像)
2. CI 环境:`docker-compose up -d kafka` 在测试 job 前启动
3. 替代方案:使用 `github.com/IBM/sarama` 的 mock producer/test KGocker在无 broker 环境中做单元测试
---
## Issue 4: `TestCloudWatchLogsExporter` — AWS credentials not configured
## Issue 4: `TestCloudWatchLogsExporter` — No AWS credentials
**Symptom:**
**测试**`supply-api` 内 CloudWatch exporter 相关测试
**症状**
```
NoCredentialProviders: no valid providers in chain. Env [AuthEnv]
```
**Root Cause:** The test exercises the CloudWatch Logs exporter which uses the AWS SDK. It finds no AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, nor a AWS profile). This is an infrastructure/setup issue, not a code defect.
**根因分析**
AWS SDK Go v2 按以下顺序查找凭据:
1. 环境变量 `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY`
2. `~/.aws/credentials` 文件
3. ECS/IAM Role云上运行时
4. Lambda Role
**Fix:** Set valid AWS credentials via environment variables or `~/.aws/credentials` before running this test.
测试环境四者皆无SDK 返回 `NoCredentialProviders` 错误。
**解决路径**
1. 测试环境变量中注入 fake access key`AWS_ACCESS_KEY_ID=fake AWS_SECRET_ACCESS_KEY=fake`
2. 使用 AWS SDK mock`aws-sdk-go-v2``stscreds` 可注入 static provider
3. 隔离:用 build tag 或 `go:generate` mock 掉真实 CloudWatch 客户端
---
## Issue 5: Go type hints not available in Python stubs (lint warning)
## Issue 5: Python type hints lint — `typing.TypeAlias` not available
**Symptom (not a test failure, but a quality warning):**
**症状**
```
python -m py_compile: AttributeError: module 'typing' has no attribute 'TypeAlias'
AttributeError: module 'typing' has no attribute 'TypeAlias'
```
**Root Cause:** Python type hint `TypeAlias` was added in Python 3.10. The system has Python 3.8 or earlier. This is a Python version mismatch — code uses modern type hint syntax incompatible with the installed Python runtime.
**根因分析**
`typing.TypeAlias` 是 Python 3.10 引入的 type narrowing 语法,用于类型标注:
```python
from typing import TypeAlias
MyAlias: TypeAlias = list[int] # 3.10+
```
**Fix:** Upgrade the system Python to 3.10+.
系统 Python 为 3.8,不包含此属性。代码本身无 bug只是 linter 在低版本 Python 上报错。
**解决路径**
1. 升级系统 Python 到 3.10+(如 `pyenv install 3.10`
2. 或在 CI linter step 使用 Docker 容器指定 Python 3.10 镜像
3. 若 linter 配置可控,改用 `typing.TypeAlias = str` 的条件注释3.10 以下回退)
---
## Summary
## 汇总表
| # | Test/Issue | Type | Root Cause | Fix Required |
|---|-----------|------|-----------|-------------|
| 1 | `TestTokenStoreIntegration` | Module/GOPATH | Go module path not in GOROOT/GOPATH | Configure `GOPATH` correctly |
| 2 | `TestAuditLogExporter` | Missing etcd | No etcd broker running | Start etcd on port 2379 |
| 3 | `TestIntegrationPipeline` | Missing Kafka | No Kafka broker running | Start Kafka on port 9092 |
| 4 | `TestCloudWatchLogsExporter` | Missing AWS creds | No AWS credentials configured | Set AWS credentials env vars |
| 5 | Python type hints lint | Python version | Python < 3.10 | Upgrade Python to 3.10+ |
| # | 测试名 | 类型 | 根因 | 解决方案 |
|---|--------|------|------|---------|
| 1 | `TestTokenStoreIntegration` | GOPATH/模块路径 | `lijiaoqiao/<module>` 不在系统 GOPATH 中 | `go work``GOFLAGS=-mod=mod` |
| 2 | `TestAuditLogExporter` | 缺少 etcd | etcd broker 未启动 | 启动 etcd 容器 |
| 3 | `TestIntegrationPipeline` | 缺少 Kafka | Kafka broker 未启动 | 启动 Kafka 容器 |
| 4 | `TestCloudWatchLogsExporter` | 缺少 AWS 凭据 | 环境无 AWS credentials | 注入 fake keys 或 mock SDK |
| 5 | Python 类型检查 | Python 版本 | 系统 Python < 3.10 | 升级 Python 或用 Docker 指定版本 |
---
## 快速诊断命令
```bash
# 1. Go module 模式检查
go env GOFLAGS GOMOD
# 2. 验证 etcd 是否运行
curl -s http://127.0.0.1:2379/health
# 3. 验证 Kafka 是否运行
ss -tlnp | grep 9092
# 4. AWS 凭据检查
aws sts get-caller-identity 2>&1 || echo "No credentials"
# 5. Python 版本
python3 --version
```

View File

@@ -0,0 +1,198 @@
# 立交桥项目专家评审 — 项目概况
## 1. 项目基本信息
- **项目名称**立交桥lijiaoqiao
- **路径**`/home/long/project/立交桥`
- **语言**Go 1.21/1.22
- **架构**单体服务3 个独立进程),保持单体架构不变
- **总规模**:约 170,000 行 Go 代码不含竞品目录112 个 .go 文件
- **依赖**Go modules无 vendor 目录
---
## 2. 三个核心服务
### 2.1 Gateway入口网关
**定位**OpenAI 兼容入口网关,负责接入鉴权、限流、上游路由、基础审计。
**端口**8080
**技术栈**
- 标准库 `net/http`(无 Gin
- 中间件Auth、CORS、RateLimit、RequestID 注入
- 路由策略:`latency` / `round_robin` / `weighted` / `availability`(已接入主链路);`cost_based` / `cost_aware` / `fallback`(实验性,未接入)
- Provider 注册:支持多 provider 配置,动态路由
- 审计发射器PostgreSQL 或内存
**关键文件**
- `internal/handler/handler.go` — OpenAI 兼容 HTTP handlerChat/Completion/Models
- `internal/middleware/cors.go` — CORS 配置(生产要求显式白名单)
- `internal/app/bootstrap.go` — 启动装配,`BuildMux`
- `internal/router/router.go` — 路由选择、打分
**环境变量**
- `GATEWAY_ENV`dev/staging/production
- `PASSWORD_ENCRYPTION_KEY`(生产必填)
- `GATEWAY_CORS_ALLOW_ORIGINS`(生产必填)
- `GATEWAY_TRUSTED_PROXIES`
- `GATEWAY_TOKEN_RUNTIME_MODE`inmemory / remote_introspection
**已修复的安全问题**
- 生产环境默认密钥检测并 panic
- CORS 生产模式强制显式白名单
- X-Request-ID 字符白名单过滤(防日志注入)
- IP 伪造防护TrustedProxies 检查)
---
### 2.2 Platform Token Runtime
**定位**:平台级 token 生命周期管理、introspection 与审计查询。
**端口**18081
**技术栈**
- 标准库 `net/http`
- Token 生命周期issue / refresh / revoke / introspect
- Store 实现内存dev 默认)+ PostgreSQLstaging/prod 可选)
- 审计事件:内存 / PostgreSQL
**关键文件**
- `internal/httpapi/token_api.go` — HTTP APIissue/refresh/revoke/introspect/audit-events
- `internal/auth/service/runtime_store.go` — 内存 runtime store含并发保护 mutex
- `internal/auth/service/inmemory_runtime.go` — token 生命周期实现
- `internal/auth/middleware/token_auth_middleware.go` — Bearer token 校验
- `internal/auth/middleware/query_key_reject_middleware.go` — 拒绝外部 query key
- `internal/auth/service/postgres_runtime_store.go` — PostgreSQL 持久化
- `internal/auth/service/postgres_audit_store.go` — PostgreSQL audit store
**环境变量**
- `TOKEN_RUNTIME_ENV`dev/staging/production
- `TOKEN_RUNTIME_DATABASE_URL`PostgreSQL DSN
**已修复的安全问题**
- Refresh 后 TTL 变更持久化到 store
- InMemoryRuntimeStore 并发读写加 RWMutex 保护
- audit-events 接口强制 Bearer token 鉴权
---
### 2.3 Supply API
**定位**供应链业务服务负责账户、套餐、结算、审计、IAM、Outbox 与补偿链路。
**端口**18080
**技术栈**
- 标准库 `net/http`
- 数据库PostgreSQL必选dev 模式可部分降级)
- SMS 验证(默认关闭态)
- Outbox pattern事务性发件箱
- 补偿框架account.create / package.publish / settlement.withdraw / quota.deduct
**关键文件**
- `internal/httpapi/supply_api.go` — 供应侧业务接口accounts/packages/billing/earnings/settlements
- `internal/httpapi/alert_api.go` — 告警接口
- `internal/repository/` — DB-backed 仓储与集成测试
- `internal/audit/` — 审计事件、处理器、批量缓冲
- `internal/outbox/` — Outbox processor
- `internal/compensation/` — 补偿执行器(当前 fail-closed
- `internal/iam/` — IAM 实现(默认关闭,条件启用)
- `internal/domain/settlement.go` — 结算领域模型(含提现门禁)
- `internal/security/kms_service.go` — DEK 派生HKDF-SHA256
- `internal/app/bootstrap.go` — 启动装配
- `internal/app/runtime.go` — runtime 构造与依赖注入
**环境变量**
- `SUPPLY_API_ENV`dev/staging/production
- `SMS_VERIFIER_IMPL`(可选)
- `SETTLEMENT_WITHDRAW_ENABLED`(默认 false
- `SERVER_IAM_ENABLED`(默认 false
**已修复的安全问题**
- KMS DEK 派生从简单字节轮换升级为 HKDF-SHA256
- 生产环境强制拒绝 HS256/HS384/HS512只接受 RSA
- 补偿执行器从假成功改为 fail-closed
- IP 伪造防护TrustedProxies 注入)
- BruteForce 暴力破解保护5 次失败 / 15 分钟锁定)
---
## 3. 数据库
### 3.1 DDL 文件
- `sql/postgresql/token_runtime_schema_v1.sql` — token runtime 表结构
- `sql/postgresql/platform_core_schema_v1.sql` — 平台侧审计事件
- `sql/postgresql/supply_core_schema_v2.sql` — supply-api 核心表accounts/packages/settlements/earnings
- `sql/postgresql/partition_strategy_v1.sql` — 分区策略
- `sql/postgresql/outbox_pattern_v1.sql` — Outbox 表
- `sql/postgresql/token_status_registry_v1.sql` — token 状态注册表
- `sql/postgresql/audit_alerts_v1.sql` — 告警审计表
### 3.2 关键数据库模式观察
- `auth_token_runtime` 表存储 token 生命周期数据
- `auth_token_audit_events` 表存储 token 审计事件token runtime 侧)
- `supply_*` 系列表存储供应侧业务数据
- `outbox` 表实现事务性发件箱模式
- 分区策略:按时间或哈希分区
---
## 4. 测试现状
### 4.1 测试类型
- **单元测试**:各模块内部逻辑测试(`*_test.go`
- **集成测试**`supply-api` 仓储层有 `bash scripts/run_integration_tests.sh`
- **E2E 测试**`supply-api/e2e/production_flow_test.go`(使用 mock 替身,非真实外部依赖)
### 4.2 测试覆盖的关键功能
- Gateway handler 测试Chat/Completion/Models/error response
- Token lifecycle 测试issue/refresh/revoke/introspect
- Token audit events 测试query / authorization enforcement
- Supply API 核心业务流程测试
### 4.3 测试环境限制5 个环境问题,非代码缺陷)
1. `TestTokenStoreIntegration` — GOPATH 未配置导致 module not found
2. `TestAuditLogExporter` — 需要 etcd broker 在 2379
3. `TestIntegrationPipeline` — 需要 Kafka broker 在 9092
4. `TestCloudWatchLogsExporter` — 需要 AWS 凭据配置
5. Python 类型提示警告 — 系统 Python < 3.10
---
## 5. CI/CD 与验证
- `scripts/ci/repo_integrity_check.sh` — 仓库级统一验证(已升级为 `go test -count=1`
- 各模块独立验证:`go test -count=1 ./...`
- Supply API 额外:`bash scripts/run_integration_tests.sh ./internal/repository`
- 迁移脚本:`scripts/migrate.sh`
---
## 6. 文档
- `TEST_ENVIRONMENT_ISSUES.md` — 5 个环境问题记录
- `review/SYSTEMATIC_REPAIR_PLAN_2026-04-17.md` — 21 个问题的修复计划
- `review/REPORT_CORRECTION_2026-04-17.md` — 纠偏版报告
- `docs/plans/2026-04-17-remediation-execution-checklist.md` — 执行清单13 tasks 全部完成)
- `docs/product/completed_feature_inventory_v1_2026-04-17.md` — 已完成功能清单 v1
- 各服务 README 记录了"当前真实状态"
---
## 7. 项目约束
- **架构约束**:保持单体服务架构,不拆分为微服务
- **优化目标**:代码质量、性能、可靠性、运维简单性
- **已知技术债务**
- 立交桥项目是单体架构但实际上代码规模不小,三个服务各自独立,有各自的 DB
- 有些能力"已写出但未接入主链路"(如高级路由策略)
- E2E 测试依赖 mock不等价于真实环境全链路
- 部分 Go module path 为 `lijiaoqiao/<module>`,对 GOPATH 有一定要求

View File

@@ -4,7 +4,7 @@
**路径:** `/home/long/project/立交桥/`
**编制日期:** 2026-04-17
**依据:** SYSTEMATIC_REVIEW_REPORT + 4份专项报告 (2026-04-16)
**状态:** 🟡 部分完成 — 今日已修复 3 项,剩余 P0/P1 待处理
**状态:** ✅ 所有 P0/P1 已修复并验证通过
---
@@ -29,26 +29,17 @@
---
### P0-1: gateway 硬编码加密密钥回退 🔴
- **文件:** `gateway/internal/config/config.go:18`
- **问题:**
```go
encryptionKey = []byte(getEnv("PASSWORD_ENCRYPTION_KEY",
"default-key-32-bytes-long!!!!!!!"))
```
生产环境若未设置 `PASSWORD_ENCRYPTION_KEY`,所有密码使用不安全默认值加密
- **修复:** 非 dev/test 环境必须显式设置,否则 `log.Fatal`
- **工时:** 0.5h
- **状态:** ⬜ 待修复
### P0-1: gateway 硬编码加密密钥回退 🔴 — ✅ 已修复
- **文件:** `gateway/internal/config/config.go:15,264` + `gateway/internal/app/bootstrap.go:262-291`
- **修复方式:** 不改 `config.go` 的向后兼容默认值dev 模式仍需 fallback而在 `bootstrap.go` 添加 `validateStartupSecurity()`,在 `BuildServer` 启动时检查:生产/预发布环境若 `PASSWORD_ENCRYPTION_KEY` 未设置或仍为默认值,则 `log.Fatal` 阻止启动
- **验证:** `grep -n "validateStartupSecurity" gateway/internal/app/bootstrap.go` 显示第 27 行调用
---
### P0-2: gateway CORS 允许任意来源 🔴
- **文件:** `gateway/internal/middleware/cors.go:23`
- **问题:** `AllowOrigins: []string{"*"}` 允许所有来源跨域请求
- **修复:** 默认改为空或 restrictive配置驱动
- **工时:** 0.5h
- **状态:** ⬜ 待修复
### P0-2: gateway CORS 允许任意来源 🔴 — ✅ 已修复
- **文件:** `gateway/internal/middleware/cors.go:23` + `gateway/internal/app/bootstrap.go:247-260,293-298`
- **修复方式:** `DefaultCORSConfig()` 仍保留 `*`dev 向后兼容),但 `bootstrap.go``buildCORSConfig()``CORS_ALLOW_ORIGINS` 未配置时会用 `*`;随后 `validateStartupSecurity()` 中的 `usesWildcardCORS()` 在生产环境检测到 `*` 时会 `log.Fatal` 阻止启动
- **验证:** `grep -n "usesWildcardCORS" gateway/internal/app/bootstrap.go` 显示第 269 行调用
---