feat(specs): backfill and refresh revision summaries
GitHub issue · Closed
Summary
- Refresh open spec pages when the async revision summary worker stores an agent-written summary.
- Fall back to a plain Condukt run when structured revision summary output is not submitted, so current revisions can still get useful summaries.
- Show an explicit pending message while agent summaries are enabled instead of presenting the additions/removals fallback as final.
- Add a periodic revision summary sweeper that backfills missing summaries by spawning one worker per revision, and sanitize worker errors before Oban stores them.
- Update Tuist production to Fireworks’ current Kimi K2.7 Code model through ReqLLM’s native provider:
fireworks_ai:accounts/fireworks/models/kimi-k2p7-code.
- Scale the Hive production worker pool from 2 to 4 workers so new app pods can run on fresh egress IPs that Fireworks accepts.
- Persist the CNPG WAL archive safety-check override required for Hive’s existing dedicated backup path after the production timeline change.
Production diagnosis
- Production has 31 eligible revisions with 0 stored summaries, and recent revision summary jobs were discarded after Fireworks returned HTTP 403.
- Models.dev and Fireworks list Kimi K2.7 Code under
accounts/fireworks/models/kimi-k2p7-code; the configured Hive Fireworks key can call it successfully outside the cluster.
- The pod’s
HIVE_LLM_API_KEY matches the current 1Password item by hash, so this is not secret drift.
- The original Hive worker IPs
178.105.102.177 and 178.105.115.239 return Fireworks’ HTML HTTP 403 before API-key authentication, including with no Authorization header.
- Atlas production can call the same Fireworks route and
accounts/fireworks/models/kimi-k2p7-code successfully from its app pods, so the issue was specific to Hive’s original cluster egress IPs.
- I scaled Hive’s CAPI worker pool to 4 workers. The new worker IPs
138.199.154.38 and 167.233.74.44 return Fireworks’ normal JSON HTTP 401 without auth, confirming they are not blocked at the edge.
- I verified an authenticated production call from the new worker with
accounts/fireworks/models/kimi-k2p7-code; Fireworks returned HTTP 200.
- I cordoned the two blocked workers and restarted
deploy/hive; the two live Hive app pods are now split across the two new accepted workers.
hive-postgres-1 was failing because CNPG was blocked by barman-cloud-check-wal-archive returning Expected empty archive after a timeline change on the existing dedicated archive path. I applied cnpg.io/skipEmptyWalArchiveCheck=enabled, let the WAL backlog drain, moved both Postgres instances to the fresh workers, and verified a manual backup completed.
- The currently deployed app still has the pre-PR
openai:accounts/fireworks/models/kimi-k2p5 config until this branch is merged and deployed, but a live Hive.Agents.Sessions call now succeeds from production after the egress rotation.
Testing
mix test test/hive/specs/revision_summary_sweeper_test.exs test/hive/specs/revision_summary_worker_test.exs test/hive/specs/revision_summaries_test.exs test/hive_web/live/spec_live/show_test.exs
mix compile --warnings-as-errors
mix test test/hive/specs/revision_summary_worker_test.exs
helm template hive infra/helm/hive -f infra/helm/hive/values-production.yaml
KUBECONFIG=/Users/pepicrft/.kube/tuist-mgmt.yaml kubectl apply --dry-run=server -f infra/k8s/cluster-production.yaml
- Production Fireworks probes from old and new worker nodes, plus one live
Hive.Agents.Sessions call through hive rpc.
- Production CNPG remediation: verified
hive-postgres healthy with 2 ready instances, zero queued WALs, streaming replication active, and completed hive-postgres-manual-20260619 backup.