Hive
fix: bound spec-write lock waits and harden agent runs against slow LLMs
GitHub issue · Closed
Two independent production reliability fixes that surfaced from the same incident triage.
1. Spec writes can’t hang on a row lock
What
Bound how long any DB connection can hold a row lock, via Hive.Repo connection parameters (config/runtime.exs):
- TCP keepalives (
tcp_keepalives_idle/interval/count) — Postgres detects a dead/severed client and closes the connection, rolling back its transaction and releasing its locks. idle_in_transaction_session_timeout(60s) — Postgres rolls back a transaction abandoned mid-write so it can’t keep holding the row lock.
Why
update_spec was hanging indefinitely in production. The optimistic-lock UPDATE specs … WHERE id = $1 AND lock_version = $2 must take a row lock on the existing spec row. That row was held by a stuck/abandoned connection — a request died (or its connection was severed) mid-write, leaving the server-side transaction idle-in-transaction — and Hive had no keepalives or idle timeout configured, so nothing ever reaped it and the wait was infinite.
The symptom pattern fit exactly:
- Reads work — they take no lock.
create_specworks — it inserts a new row, never contending with the locked one.- An invalid update fast-errors — Ecto short-circuits an invalid changeset before issuing SQL, so it never reaches the locked
UPDATE. - A valid
update_spechangs forever — it blocks acquiring the existing row’s lock.
With keepalives + idle_in_transaction_session_timeout, an orphaned holder is reaped within ~the idle timeout, so the waiting write then proceeds and succeeds. This mirrors tuist/server, which sets the same kind of connection parameters and deliberately does not special-case lock contention — ordinary concurrent edits are already handled by the optimistic lock_version (a fast stale_revision), so there’s no bespoke error path. It also protects every write path, not just specs.
Changes
config/runtime.exs— addparameters: [tcp_keepalives_*, idle_in_transaction_session_timeout]to the prodHive.Repoconfig.
Verification
- Confirmed against production: the orphaned lock had already cleared, spec #69 was lockable, and a live
update_specon #69 completed instantly and bumped it to revision 2.
Note on history: an earlier commit on this branch implemented this with per-transaction
SET LOCAL lock_timeoutand an{:error, :locked}surfaced through the MCP tools and LiveViews. That was reverted in favor of the connection-level approach above — simpler, global, and matching the reference codebase’s convention.
2. A slow LLM can’t starve the agent HTTP pool (Sentry HIVE-2)
What
Bound each agent run and give the agent HTTP pool headroom, so a slow model degrades gracefully instead of exhausting connections.
lib/hive/agents/sessions.ex— cap every Condukt run at 2 minutes (was Condukt’s 5-minute default; callers can still override:timeout).config/config.exs— raise ReqLLM’sReqLLM.Finchpool from the default 8 connections (stream_pool_count8 → 16).
Why
Every agent run (revision summaries, issue triage, domain evolution) makes its LLM calls through ReqLLM’s own ReqLLM.Finch pool, which defaults to 8 connections per host. The configured model is a large, slow one, so under load — amplified by ReqLLM’s max_retries: 3, Oban max_attempts: 3, and the 15-min summary sweeper — those 8 slots starve and new calls fail with “Finch was unable to provide a connection” (HIVE-2), cascading into agent session timeouts (HIVE-A / HIVE-4).
This is isolated from SQL: every agent worker reads, releases its DB connection, makes the LLM call, then writes — the model call holds an HTTP connection, never an Ecto/DB one. So the blast radius is agent features only, never the dashboard or spec writes.
The 2-minute cap frees an agents-queue worker quickly instead of letting a stalled run linger for 5 minutes; the larger pool absorbs retry/teardown churn. Real concurrency is bounded by the :agents queue, so the extra connections add no endpoint load.
Not in this PR (follow-ups)
- Per-task model tiers (root cause of the slowness). Every agent shares one model (
HIVE_LLM_MODEL), so the heavyweight code model runs even trivial background jobs — revision-diff summaries, issue/drop classification, domain evolution. Follow-up: split into tiers — a fast, light model for the background structured agents and a stronger model only for the user-facing SlackConversationAgent.Sessions.run/run_operationalready merges a per-callmodel:override, so this is mainly config plumbing (e.g. aHIVE_LLM_MODEL_FAST) plus each agent declaring its tier. This is the real cure; the two changes above just keep a slow model from taking the agent pool down with it. - Each LLM call also emits an
IO.warn-with-stacktrace because the model isn’t in ReqLLM’s catalog (“unverified model”) — log/Sentry noise, not a failure; needs an inline model spec to silence. Hive.Agents.Tools.FetchUrlContentuses the defaultReq.Finchpool (shared with Slack/GitHub/RSS); routing it through the agent pool would fully isolate agent HTTP.- A no-op spec update returns a misleading
spec_id has already been taken(Ecto skips thelock_versionbump on an empty changeset). Tracked separately.
Verification
mix test— 741 tests, 0 failures.mix format --check-formattedandmix credoclean.
No GitHub comments yet.