niuniu/lijiaoqiao

Fork 0

Files

Your Name a699a1ac98 chore: initial snapshot for gitea/github upload

2026-03-26 16:04:46 +08:00

36 KiB

Raw Blame History

详细技术架构设计

版本：v1.0 日期：2026-03-18 依据：backend skill 最佳实践状态：历史草稿（已被 technical_architecture_optimized_v2_2026-03-18.md 替代，不作为实施基线）

1. 系统架构概览

1.1 整体架构图

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                                    客户端层                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐           │
│  │ Web App    │  │ Mobile App  │  │ SDK (Python)│  │ SDK (Node)  │           │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘           │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                                 API Gateway (入口层)                               │
│  ┌─────────────────────────────────────────────────────────────────────────────┐ │
│  │ • 限流 • 鉴权 • 路由 • 日志 • 监控                                         │ │
│  └─────────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                                 业务服务层                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  Router      │  │  Auth       │  │  Billing     │  │  Provider    │       │
│  │  Service     │  │  Service    │  │  Service    │  │  Service    │       │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  Tenant     │  │  Risk       │  │  Settlement │  │  Webhook    │       │
│  │  Service    │  │  Service    │  │  Service    │  │  Service    │       │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                                 基础设施层                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  PostgreSQL  │  │    Redis     │  │    Kafka     │  │    S3/MinIO │       │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                                 外部集成层                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   subapi     │  │  OpenAI API │  │ Anthropic   │  │  国内供应商   │       │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────────────────────────────┘

2. 技术选型

2.1 技术栈

层级	技术	版本	说明
API Gateway	Kong / Traefik	3.x	高性能网关
后端服务	Go	1.21	高并发
Web框架	Gin	1.9	高性能
数据库	PostgreSQL	15	主数据库
缓存	Redis	7.x	缓存+限流
消息队列	Kafka	3.x	异步处理
服务网格	Istio	1.18	微服务治理
容器编排	Kubernetes	1.28	容器编排
CI/CD	GitHub Actions	-	持续集成
监控	Prometheus + Grafana	-	可观测性
日志	ELK Stack	8.x	日志收集

2.2 项目结构

llm-gateway/
├── cmd/                      # 入口程序
│   ├── gateway/              # 网关服务
│   ├── router/               # 路由服务
│   ├── billing/              # 计费服务
│   └── admin/                # 管理后台
├── internal/                 # 内部包
│   ├── config/               # 配置管理
│   ├── middleware/           # 中间件
│   ├── handler/              # HTTP处理器
│   ├── service/              # 业务逻辑
│   ├── repository/           # 数据访问
│   └── model/                # 数据模型
├── pkg/                      # 公共包
│   ├── utils/                # 工具函数
│   ├── errors/               # 错误定义
│   └── constants/            # 常量定义
├── api/                      # API定义
│   ├── openapi/              # OpenAPI规范
│   └── proto/                # Protobuf定义
├── configs/                  # 配置文件
├── scripts/                  # 脚本
├── test/                     # 测试
└── docs/                     # 文档

3. 模块详细设计

3.1 API Gateway 模块

// cmd/gateway/main.go
package main

import (
    "github.com/gin-gonic/gin"
    "llm-gateway/internal/middleware"
    "llm-gateway/internal/handler"
)

func main() {
    r := gin.Default()

    // 全局中间件
    r.Use(middleware.Logger())
    r.Use(middleware.Recovery())
    r.Use(middleware.CORS())

    // 限流
    r.Use(middleware.RateLimiter())

    // API路由
    v1 := r.Group("/v1")
    {
        // 认证
        v1.POST("/auth/token", handler.AuthToken)
        v1.POST("/auth/refresh", handler.RefreshToken)

        // 对话
        v1.POST("/chat/completions", middleware.AuthRequired(), handler.ChatCompletions)
        v1.POST("/completions", middleware.AuthRequired(), handler.Completions)

        // Embeddings
        v1.POST("/embeddings", middleware.AuthRequired(), handler.Embeddings)

        // 模型
        v1.GET("/models", handler.ListModels)

        // 用户
        users := v1.Group("/users")
        users.Use(middleware.AuthRequired())
        {
            users.GET("", handler.ListUsers)
            users.GET("/:id", handler.GetUser)
            users.PUT("/:id", handler.UpdateUser)
        }

        // API Key
        keys := v1.Group("/keys")
        keys.Use(middleware.AuthRequired())
        {
            keys.GET("", handler.ListKeys)
            keys.POST("", handler.CreateKey)
            keys.DELETE("/:id", handler.DeleteKey)
            keys.POST("/:id/rotate", handler.RotateKey)
        }

        // 计费
        billing := v1.Group("/billing")
        billing.Use(middleware.AuthRequired())
        {
            billing.GET("/balance", handler.GetBalance)
            billing.GET("/usage", handler.GetUsage)
            billing.GET("/invoices", handler.ListInvoices)
        }

        // 供应方
        supply := v1.Group("/supply")
        supply.Use(middleware.AuthRequired())
        {
            supply.GET("/accounts", handler.ListAccounts)
            supply.POST("/accounts", handler.CreateAccount)
            supply.GET("/packages", handler.ListPackages)
            supply.POST("/packages", handler.CreatePackage)
            supply.GET("/earnings", handler.GetEarnings)
        }
    }

    // 管理后台
    admin := r.Group("/admin")
    admin.Use(middleware.AdminRequired())
    {
        admin.GET("/stats", handler.AdminStats)
        admin.GET("/users", handler.AdminListUsers)
        admin.POST("/users/:id/disable", handler.DisableUser)
    }

    r.Run(":8080")
}

3.2 路由服务模块

// internal/service/router.go
package service

import (
    "context"
    "time"
    "llm-gateway/internal/model"
    "llm-gateway/internal/adapter"
)

type RouterService struct {
    adapterRegistry *adapter.Registry
    metricsCollector *MetricsCollector
}

type RouteRequest struct {
    Model      string                 `json:"model"`
    Messages   []model.Message      `json:"messages"`
    Options    model.CompletionOptions `json:"options"`
    UserID     int64                `json:"user_id"`
    TenantID   int64                `json:"tenant_id"`
}

func (s *RouterService) Route(ctx context.Context, req RouteRequest) (*model.CompletionResponse, error) {
    // 1. 获取可用供应商
    providers := s.adapterRegistry.GetAvailableProviders(req.Model)
    if len(providers) == 0 {
        return nil, ErrNoProviderAvailable
    }

    // 2. 选择最优供应商
    selected := s.selectProvider(providers, req)

    // 3. 记录路由决策
    s.metricsCollector.RecordRoute(ctx, &RouteMetrics{
        Model: req.Model,
        Provider: selected.Name(),
        TenantID: req.TenantID,
    })

    // 4. 调用供应商
    resp, err := selected.Call(ctx, req)
    if err != nil {
        // 5. 失败时尝试fallback
        return s.tryFallback(ctx, req, err)
    }

    return resp, nil
}

func (s *RouterService) selectProvider(providers []*adapter.Provider, req RouteRequest) *adapter.Provider {
    // 多维度选择策略
    var best *adapter.Provider
    bestScore := -1.0

    for _, p := range providers {
        score := s.calculateScore(p, req)
        if score > bestScore {
            bestScore = score
            best = p
        }
    }

    return best
}

func (s *RouterService) calculateScore(p *adapter.Provider, req RouteRequest) float64 {
    // 延迟评分 (40%)
    latencyScore := 1.0 / (p.LatencyP99 + 1)

    // 可用性评分 (30%)
    availabilityScore := p.Availability

    // 成本评分 (20%)
    costScore := 1.0 / (p.CostPer1K + 1)

    // 质量评分 (10%)
    qualityScore := p.QualityScore

    return latencyScore*0.4 + availabilityScore*0.3 + costScore*0.2 + qualityScore*0.1
}

3.3 Provider Adapter 模块

// internal/adapter/registry.go
package adapter

import (
    "context"
    "sync"
)

type Registry struct {
    mu        sync.RWMutex
    providers map[string]Provider
    fallback  map[string]string
    health    map[string]*HealthStatus
}

type Provider interface {
    Name() string
    Call(ctx context.Context, req interface{}) (interface{}, error)
    HealthCheck(ctx context.Context) error
    GetCapabilities() Capabilities
}

type Capabilities struct {
    SupportsStreaming bool
    SupportsFunctionCall bool
    SupportsVision bool
    MaxTokens int
    Models []string
}

type HealthStatus struct {
    IsHealthy  bool
    Latency   time.Duration
    LastCheck time.Time
}

func NewRegistry() *Registry {
    return &Registry{
        providers: make(map[string]Provider),
        fallback:  make(map[string]string),
        health:   make(map[string]*HealthStatus),
    }
}

func (r *Registry) Register(name string, p Provider, fallback string) {
    r.mu.Lock()
    defer r.mu.Unlock()

    r.providers[name] = p
    if fallback != "" {
        r.fallback[name] = fallback
    }

    // 启动健康检查
    go r.healthCheckLoop(name, p)
}

func (r *Registry) Get(name string) (Provider, error) {
    r.mu.RLock()
    defer r.mu.RUnlock()

    p, ok := r.providers[name]
    if !ok {
        return nil, ErrProviderNotFound
    }

    // 检查健康状态
    if health, ok := r.health[name]; ok && !health.IsHealthy {
        // 尝试fallback
        if fallback, ok := r.fallback[name]; ok {
            return r.providers[fallback], nil
        }
    }

    return p, nil
}

func (r *Registry) healthCheckLoop(name string, p Provider) {
    ticker := time.NewTicker(30 * time.Second)
    for range ticker.C {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        err := p.HealthCheck(ctx)
        cancel()

        r.mu.Lock()
        r.health[name] = &HealthStatus{
            IsHealthy: err == nil,
            LastCheck: time.Now(),
        }
        r.mu.Unlock()
    }
}

3.4 计费服务模块

// internal/service/billing.go
package service

import (
    "context"
    "decimal"
    "llm-gateway/internal/model"
)

type BillingService struct {
    repo        *repository.BillingRepository
    balanceMgr  *BalanceManager
    notifier    *WebhookNotifier
}

type Money struct {
    Amount   decimal.Decimal
    Currency string
}

func (s *BillingService) ProcessRequest(ctx context.Context, req *model.LLMRequest) (*model.BillingRecord, error) {
    // 1. 预扣额度
    estimatedCost := s.EstimateCost(req)
    reserved, err := s.balanceMgr.Reserve(ctx, req.UserID, estimatedCost)
    if err != nil {
        return nil, ErrInsufficientBalance
    }

    // 2. 处理请求（实际扣费）
    actualCost := s.CalculateActualCost(req.Response)

    // 3. 补偿差额
    diff := actualCost.Sub(reserved.Amount)
    if diff.IsPositive() {
        err = s.balanceMgr.Charge(ctx, req.UserID, diff)
    } else if diff.IsNegative() {
        err = s.balanceMgr.Refund(ctx, req.UserID, diff.Abs())
    }

    // 4. 记录账单
    record := &model.BillingRecord{
        UserID:         req.UserID,
        RequestID:       req.ID,
        Model:          req.Model,
        PromptTokens:   req.Response.Usage.PromptTokens,
        CompletionTokens: req.Response.Usage.CompletionTokens,
        Amount:         actualCost,
        Status:         model.BillingStatusSettled,
    }

    err = s.repo.Create(ctx, record)
    if err != nil {
        // 记录失败，触发补偿
        s.notifier.NotifyBillingAnomaly(ctx, record, err)
    }

    return record, nil
}

func (s *BillingService) EstimateCost(req *model.LLMRequest) Money {
    // 使用模型定价估算
    price := s.repo.GetModelPrice(req.Model)

    promptCost := decimal.NewFromInt(int64(req.Messages.Tokens()))
        .Mul(price.InputPer1K)
        .Div(decimal.NewFromInt(1000))

    // 估算输出
    estimatedOutput := decimal.NewFromInt(int64(req.Options.MaxTokens))
    outputCost := estimatedOutput
        .Mul(price.OutputPer1K)
        .Div(decimal.NewFromInt(1000))

    total := promptCost.Add(outputCost)

    return Money{Amount: total.Round(2), Currency: "USD"}
}

3.5 风控服务模块

// internal/service/risk.go
package service

import (
    "llm-gateway/internal/model"
)

type RiskService struct {
    rules       []RiskRule
    rateLimiter *RateLimiter
}

type RiskRule struct {
    Name      string
    Condition func(*model.LLMRequest, *model.User) bool
    Score     int
    Action    RiskAction
}

type RiskAction string

const (
    RiskActionAllow  RiskAction = "allow"
    RiskActionBlock  RiskAction = "block"
    RiskActionReview RiskAction = "review"
)

func (s *RiskService) Evaluate(ctx context.Context, req *model.LLMRequest) *RiskResult {
    var totalScore int
    var triggers []string

    user := s.getUser(ctx, req.UserID)

    for _, rule := range s.rules {
        if rule.Condition(req, user) {
            totalScore += rule.Score
            triggers = append(triggers, rule.Name)
        }
    }

    // 决策
    if totalScore >= 70 {
        return &RiskResult{
            Action:  RiskActionBlock,
            Score:   totalScore,
            Triggers: triggers,
        }
    } else if totalScore >= 40 {
        return &RiskResult{
            Action:  RiskActionReview,
            Score:   totalScore,
            Triggers: triggers,
        }
    }

    return &RiskResult{
        Action:  RiskActionAllow,
        Score:   totalScore,
    }
}

// 预定义风控规则
func DefaultRiskRules() []RiskRule {
    return []RiskRule{
        {
            Name: "high_velocity",
            Condition: func(req *model.LLMRequest, user *model.User) bool {
                return req.TokensPerMinute > 1000
            },
            Score: 30,
            Action: RiskActionBlock,
        },
        {
            Name: "new_account_high_value",
            Condition: func(req *model.LLMRequest, user *model.User) bool {
                return user.AccountAgeDays < 7 && req.EstimatedCost > 100
            },
            Score: 35,
            Action: RiskActionReview,
        },
        {
            Name: "unusual_model",
            Condition: func(req *model.LLMRequest, user *model.User) bool {
                return !user.PreferredModels.Contains(req.Model)
            },
            Score: 15,
            Action: RiskActionReview,
        },
    }
}

4. 数据流设计

4.1 请求处理流程

用户请求
    │
    ▼
┌─────────────────┐
│  API Gateway    │
│  • 限流         │
│  • 鉴权         │
│  • 日志         │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  路由决策      │
│  • 模型映射    │
│  • 供应商选择 │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌────────┐
│ Provider│ │ Fallback│
│  A     │ │  B     │
└────┬───┘ └────┬───┘
     │         │
     └────┬────┘
          │
          ▼
┌─────────────────┐
│  计费处理       │
│  • 预扣         │
│  • 实际扣费    │
│  • 记录        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  响应返回      │
└─────────────────┘

4.2 异步处理流程

请求处理
    │
    ▼
┌─────────────────┐
│  同步：预扣+执行 │
│                 │
│  • 预扣额度     │
│  • 调用供应商   │
│  • 实际扣费    │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌────────┐
│ 同步响应 │ │ 异步队列│
│         │ │        │
│ • 返回   │ │ • 记录使用量│
│ • 更新   │ │ • 统计    │
│   余额   │ │ • 对账    │
└────────┘ └────┬────┘
                 │
                 ▼
          ┌─────────────┐
          │ Kafka Topic │
          │ • usage     │
          │ • billing   │
          │ • audit     │
          └──────┬──────┘
                 │
          ┌─────┴─────┐
          │           │
          ▼           ▼
    ┌─────────┐ ┌─────────┐
    │ 消费者   │ │ 消费者   │
    │ • 写入DB │ │ • 对账   │
    │ • 监控   │ │ • 告警   │
    └─────────┘ └─────────┘

5. API 设计规范

5.1 RESTful API 设计

# openapi.yaml
openapi: 3.0.3
info:
  title: LLM Gateway API
  version: 1.0.0
  description: Enterprise LLM Gateway API

servers:
  - url: https://api.lgateway.com/v1
    description: Production server
  - url: https://staging-api.lgateway.com/v1
    description: Staging server

paths:
  /chat/completions:
    post:
      summary: Create a chat completion
      operationId: createChatCompletion
      tags:
        - Chat
      security:
        - BearerAuth: []
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ChatCompletionRequest'
      responses:
        '200':
          description: Successful response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ChatCompletionResponse'
        '400':
          $ref: '#/components/responses/BadRequest'
        '401':
          $ref: '#/components/responses/Unauthorized'
        '429':
          $ref: '#/components/responses/RateLimited'
        '500':
          $ref: '#/components/responses/InternalServerError'

components:
  securitySchemes:
    BearerAuth:
      type: http
      scheme: bearer
      bearerFormat: JWT

  schemas:
    ChatCompletionRequest:
      type: object
      required:
        - model
        - messages
      properties:
        model:
          type: string
          description: Model identifier
        messages:
          type: array
          items:
            $ref: '#/components/schemas/Message'
        temperature:
          type: number
          minimum: 0
          maximum: 2
          default: 1.0
        max_tokens:
          type: integer
          minimum: 1
          maximum: 32000
        stream:
          type: boolean
          default: false

    Message:
      type: object
      required:
        - role
        - content
      properties:
        role:
          type: string
          enum: [system, user, assistant]
        content:
          type: string

  responses:
    BadRequest:
      description: Bad request
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/Error'
    Unauthorized:
      description: Unauthorized
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/Error'
    RateLimited:
      description: Rate limited
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/Error'
    InternalServerError:
      description: Internal server error
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/Error'

5.2 错误响应格式

{
  "error": {
    "code": "BILLING_001",
    "message": "Insufficient balance",
    "message_i18n": {
      "zh_CN": "余额不足",
      "en_US": "Insufficient balance"
    },
    "details": {
      "required": 100.00,
      "available": 50.00,
      "top_up_url": "/v1/billing/top-up"
    },
    "trace_id": "req_abc123",
    "retryable": false,
    "doc_url": "https://docs.lgateway.com/errors/billing-001"
  }
}

6. 数据库设计

6.1 核心表结构

-- 用户表
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    name VARCHAR(100),
    tenant_id BIGINT REFERENCES tenants(id),
    role VARCHAR(20) DEFAULT 'user',
    status VARCHAR(20) DEFAULT 'active',
    mfa_enabled BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

-- API Keys表
CREATE TABLE api_keys (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    key_hash VARCHAR(64) NOT NULL UNIQUE,
    key_prefix VARCHAR(20) NOT NULL,
    description VARCHAR(200),
    permissions JSONB DEFAULT '{}',
    rate_limit_rpm INT DEFAULT 60,
    rate_limit_tpm INT DEFAULT 100000,
    status VARCHAR(20) DEFAULT 'active',
    expires_at TIMESTAMP,
    last_used_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

-- 租户表
CREATE TABLE tenants (
    id BIGSERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    plan VARCHAR(20) DEFAULT 'free',
    status VARCHAR(20) DEFAULT 'active',
    settings JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

-- 账单记录表
CREATE TABLE billing_records (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    tenant_id BIGINT REFERENCES tenants(id),
    request_id VARCHAR(64) NOT NULL,
    provider VARCHAR(50) NOT NULL,
    model VARCHAR(50) NOT NULL,
    prompt_tokens INT NOT NULL,
    completion_tokens INT NOT NULL,
    amount DECIMAL(10, 4) NOT NULL,
    currency VARCHAR(3) DEFAULT 'USD',
    status VARCHAR(20) DEFAULT 'settled',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

-- 使用量记录表
CREATE TABLE usage_records (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    tenant_id BIGINT REFERENCES tenants(id),
    api_key_id BIGINT REFERENCES api_keys(id),
    request_id VARCHAR(64) NOT NULL,
    model VARCHAR(50) NOT NULL,
    provider VARCHAR(50) NOT NULL,
    prompt_tokens INT DEFAULT 0,
    completion_tokens INT DEFAULT 0,
    latency_ms INT,
    status_code INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 索引
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_tenant ON users(tenant_id);
CREATE INDEX idx_api_keys_user ON api_keys(user_id);
CREATE INDEX idx_api_keys_hash ON api_keys(key_hash);
CREATE INDEX idx_billing_user ON billing_records(user_id, created_at);
CREATE INDEX idx_billing_tenant ON billing_records(tenant_id, created_at);
CREATE INDEX idx_usage_user ON usage_records(user_id, created_at);
CREATE INDEX idx_usage_request ON usage_records(request_id);

7. 消息队列运维简化

7.1 Kafka运维挑战分析

挑战	影响	简化方案
集群管理复杂	运维成本高	使用托管服务
分区副本同步	数据延迟	优化配置
消费者组管理	消费积压	简化架构
监控告警	噪声过多	精简指标
容量规划	扩展困难	自动化伸缩

7.2 托管Kafka服务选型

# 消息队列服务选型
recommended:
  # 阿里云Kafka（国内）
  aliyun:
    type: managed
    version: 2.2.0
    features:
      - 自动分区重平衡
      - 死信队列支持
      - 跨可用区容灾
    ops_benefits:
      - 免运维
      - SLA 99.9%
      - 按量计费

  # AWS MSK（海外）
  aws_msk:
    type: managed
    version: 2.8.0
    features:
      - MSK Serverless免容量规划
      - 精细访问控制
    ops_benefits:
      - 与AWS生态集成
      - 托管升级

alternatives:
  # 轻量级替代方案
  redis_streams:
    use_case: "低延迟小消息"
    limitations:
      - 无持久化保证
      - 单线程消费

  # 业务简单时的选择
  database_queues:
    use_case: "消息量<1000/s"
    limitations:
      - 性能有限
      - 需自行实现重试

7.3 Topic设计简化

# 简化的Topic设计 - 从原来的10+个精简为4个
TOPIC_DESIGN = {
    # 核心业务Topic
    "llm.requests": {
        "partitions": 6,
        "retention": "7d",
        "description": "LLM请求流转"
    },

    # 异步计费Topic
    "llm.billing": {
        "partitions": 3,
        "retention": "30d",
        "description": "计费流水"
    },

    # 通知事件Topic
    "llm.events": {
        "partitions": 3,
        "retention": "3d",
        "description": "各类事件通知"
    },

    # 监控数据Topic
    "llm.metrics": {
        "partitions": 1,
        "retention": "1d",
        "description": "原始监控数据"
    }
}

7.4 消费者组简化

# 简化的消费者组设计
class SimplifiedConsumerGroup:
    """简化消费者组管理"""

    # 原来：每个服务多个消费者组
    # 优化后：一个服务一个消费者组

    def __init__(self):
        self.groups = {
            "router-service": {
                "topics": ["llm.requests"],
                "consumers": 3,  # 与分区数匹配
                "strategy": "round_robin"
            },
            "billing-service": {
                "topics": ["llm.billing"],
                "consumers": 2,
                "strategy": "failover"
            },
            "notification-service": {
                "topics": ["llm.events"],
                "consumers": 1,
                "strategy": "broadcast"
            }
        }

    def get_consumer_count(self, group: str) -> int:
        """自动计算消费者数量"""
        return self.groups[group]["consumers"]

7.5 自动化运维脚本

#!/bin/bash
# scripts/kafka-ops.sh - Kafka运维自动化

set -e

# 1. 主题健康检查
check_topics() {
    echo "=== 检查Topic状态 ==="
    kafka-topics.sh --bootstrap-server $KAFKA_BROKER --list | while read topic; do
        partitions=$(kafka-topics.sh --bootstrap-server $KAFKA_BROKER \
            --topic $topic --describe | grep -c "Leader:")
        lag=$(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
            --group $(get_group_for_topic $topic) \
            --describe | awk '{sum+=$6} END {print sum}')

        echo "Topic: $topic, Partitions: $partitions, Lag: $lag"

        if [ $lag -gt 1000 ]; then
            alert "消费积压告警: $topic 积压 $lag 条"
        fi
    done
}

# 2. 自动创建Topic（幂等）
ensure_topics() {
    for topic in "${!TOPIC_DESIGN[@]}"; do
        config="${TOPIC_DESIGN[$topic]}"
        kafka-topics.sh --bootstrap-server $KAFKA_BROKER \
            --topic $topic --create \
            --partitions ${config[partitions]} \
            --replication-factor 3 \
            --config retention.ms=${config[retention]} \
            2>/dev/null || echo "Topic $topic already exists"
    done
}

# 3. 消费延迟监控
monitor_lag() {
    for group in $(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
        --list 2>/dev/null); do
        lag=$(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
            --group $group --describe | awk '{sum+=$6} END {print sum}')

        prometheus_pushgateway "kafka_consumer_lag" $lag "group=$group"
    done
}

7.6 监控指标精简

# 精简的Kafka监控指标 - 避免噪声
kafka_metrics:
  essential:
    - name: kafka_consumer_group_lag_max
      description: 最大消费延迟
      alert_threshold: 1000

    - name: kafka_topic_partition_under_replicated
      description: 副本不同步数
      alert_threshold: 0

    - name: kafka_server_broker_topic_messages_in_total
      description: 消息入站速率
      alert_threshold: rate_change > 50%

  optional:
    # 以下指标仅在排查问题时启用
    - kafka_network_request_metrics
    - kafka_consumer_fetch_manager_metrics
    - kafka_producer_metrics

7.7 容量规划自动化

# 自动容量规划
class KafkaCapacityPlanner:
    """Kafka容量自动规划"""

    def calculate_requirements(self, metrics: dict) -> dict:
        """基于实际流量计算容量"""
        # 峰值QPS
        peak_qps = metrics["peak_qps"]

        # 平均消息大小
        avg_msg_size = metrics["avg_msg_size_kb"] * 1024

        # 保留期
        retention_days = 7

        # 计算所需磁盘
        disk_per_day = peak_qps * avg_msg_size * 86400
        total_disk = disk_per_day * retention_days

        # 推荐配置
        return {
            "partitions": min(peak_qps // 100, 12),  # 最大12分区
            "replication_factor": 3,
            "disk_gb": total_disk / (1024**3),
            "broker_count": 3,
            "scaling_trigger": "disk_usage > 70%"
        }

7.8 故障自愈机制

# Kafka故障自愈
class KafkaSelfHealing:
    """Kafka自愈机制"""

    def __init__(self):
        self.healing_rules = {
            "under_replicated": {
                "detect": "partition.replicas - in.sync.replicas > 0",
                "action": "trigger_preferred_reelection",
                "cooldown": 300  # 5分钟
            },
            "controller_failover": {
                "detect": "controller_epoch跳跃",
                "action": "等待自动选举",
                "cooldown": 60
            },
            "partition_offline": {
                "detect": "leader == -1",
                "action": "assign_new_leader",
                "cooldown": 60
            }
        }

    async def check_and_heal(self):
        """定期检查并自愈"""
        for rule_name, rule in self.healing_rules.items():
            if self.should_heal(rule_name):
                await self.execute_healing(rule_name, rule)

7. 一致性验证

7.1 与现有文档一致性

设计项	对应文档	一致性
Provider Adapter	`architecture_solution_v1.md`	✅
路由策略	`architecture_solution_v1.md`	✅
计费精度	`business_solution_v1.md`	✅
安全机制	`security_solution_v1.md`	✅
API版本管理	`api_solution_v1.md`	✅
错误码体系	`api_solution_v1.md`	✅
限流机制	`p1_optimization_solution_v1.md`	✅
Webhook	`p1_optimization_solution_v1.md`	✅

8. 实施计划

8.1 开发阶段

阶段	内容	周数
Phase 1	基础设施 + API Gateway	3周
Phase 2	核心服务开发	4周
Phase 3	集成测试	2周
Phase 4	性能优化	2周

文档状态：详细技术架构设计 关联文档：

architecture_solution_v1_2026-03-18.md
api_solution_v1_2026-03-18.md
security_solution_v1_2026-03-18.md

36 KiB Raw Blame History Unescape Escape

详细技术架构设计

1. 系统架构概览

1.1 整体架构图

2. 技术选型

2.1 技术栈

2.2 项目结构

3. 模块详细设计

3.1 API Gateway 模块

3.2 路由服务模块

3.3 Provider Adapter 模块

3.4 计费服务模块

3.5 风控服务模块

4. 数据流设计

4.1 请求处理流程

4.2 异步处理流程

5. API 设计规范

5.1 RESTful API 设计

5.2 错误响应格式

6. 数据库设计

6.1 核心表结构

7. 消息队列运维简化

7.1 Kafka运维挑战分析

7.2 托管Kafka服务选型

7.3 Topic设计简化

7.4 消费者组简化

7.5 自动化运维脚本

7.6 监控指标精简

7.7 容量规划自动化

7.8 故障自愈机制

7. 一致性验证

7.1 与现有文档一致性

8. 实施计划

8.1 开发阶段

36 KiB

Raw Blame History