Files
sub2api-cn-relay-manager/.agent/skills/false-negative-status-triage/SKILL.md
2026-05-30 14:55:16 +08:00

3.7 KiB

name, description
name description
false-negative-status-triage Diagnose and fix false-negative status signals when control-plane status says something is degraded or broken but real user traffic works. Use this whenever provider status, account status, probe status, route health, or inventory state disagrees with real `/models`, `/chat/completions`, usage logs, or verified user flows. Also use it for Chinese requests such as “误报”, “false-negative”, “状态语义不一致”, “provider_status 不准”, “last_probe_status 错误”, or “真实数据面可用但后台还显示失败”.

False-Negative Status Triage

This skill is for signal reconciliation.

Problem pattern:

  • real request path works
  • status projection still says degraded, broken, or failed

Treat this as a modeling problem first, not an outage first.

Four-layer comparison

Always compare these layers side by side:

  1. import batch result
  2. provider snapshot or aggregate status
  3. provider account inventory status
  4. real data-plane evidence

Do not jump directly from import noise to a user-facing conclusion.

Meaning of each layer

Keep these separate:

  • batch_status: did every import-time check pass?
  • provider_status: is the provider actually usable at provider level?
  • account_status: what should operators believe about a specific account asset?
  • last_probe_status: what happened in the last probe or normalized diagnostic view?

These should not all collapse to the same string.

Preferred source of truth

When real user traffic and probe-only signals disagree:

  • trust real data-plane success over probe-only failure
  • trust host usage_logs over display counters when available
  • trust access closure readiness over a single noisy account probe for provider-level availability

Normalization strategy

Use a narrow rule rather than promoting everything.

Good example:

  • batch is partial
  • access closure is ready
  • only one imported account resource exists
  • smoke model is actually present
  • raw account probe failed

In this case:

  • provider-level state can be active
  • account inventory may be normalized away from broken
  • probe display can become gateway_ready or warning

The point is to remove false negatives without hiding real breakage.

What must remain strict

Do not normalize away these cases:

  • strict import failures
  • rolled back batches
  • broken access closure
  • missing smoke model
  • multi-account scenarios where one account may really be bad

This skill is about reducing noise, not erasing legitimate failures.

Fix workflow

1. Reproduce the disagreement

Capture:

  • provider snapshot
  • provider account inventory row
  • real /models result
  • real /chat/completions result
  • usage log evidence if possible

2. Identify the wrong abstraction boundary

Typical causes:

  • provider status derived too directly from batch partiality
  • account inventory mirrors raw probe status instead of normalized availability
  • advisory or transient probe failure treated as definitive breakage

3. Add tests first

Write regression tests for:

  • provider-level promotion when access is truly ready
  • account-level normalization only in the intended narrow scenario
  • guardrails that keep real broken cases broken

4. Change semantics minimally

  • keep raw batch detail truthful
  • normalize higher-level status only where it improves operational meaning
  • avoid changing unrelated enums or broad behavior

5. Verify on a live sample

Re-read the same provider and account on a real environment after deployment.

You want to see:

  • raw batch still truthful
  • aggregate provider state corrected
  • account inventory corrected
  • real request path still working