fix: bound spec-write lock waits and harden agent runs against slow LLMs

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/hive #84

Updated

Jun 25, 2026

Domains

Hive

Details

Two independent production reliability fixes that surfaced from the same incident triage.

1. Spec writes can’t hang on a row lock

What

Bound how long any DB connection can hold a row lock, via Hive.Repo connection parameters (config/runtime.exs):

TCP keepalives (tcp_keepalives_idle/interval/count) — Postgres detects a dead/severed client and closes the connection, rolling back its transaction and releasing its locks.
idle_in_transaction_session_timeout (60s) — Postgres rolls back a transaction abandoned mid-write so it can’t keep holding the row lock.

Why

update_spec was hanging indefinitely in production. The optimistic-lock UPDATE specs … WHERE id = $1 AND lock_version = $2 must take a row lock on the existing spec row. That row was held by a stuck/abandoned connection — a request died (or its connection was severed) mid-write, leaving the server-side transaction idle-in-transaction — and Hive had no keepalives or idle timeout configured, so nothing ever reaped it and the wait was infinite.

The symptom pattern fit exactly:

Reads work — they take no lock.
create_spec works — it inserts a new row, never contending with the locked one.
An invalid update fast-errors — Ecto short-circuits an invalid changeset before issuing SQL, so it never reaches the locked UPDATE.
A valid update_spec hangs forever — it blocks acquiring the existing row’s lock.

With keepalives + idle_in_transaction_session_timeout, an orphaned holder is reaped within ~the idle timeout, so the waiting write then proceeds and succeeds. This mirrors tuist/server, which sets the same kind of connection parameters and deliberately does not special-case lock contention — ordinary concurrent edits are already handled by the optimistic lock_version (a fast stale_revision), so there’s no bespoke error path. It also protects every write path, not just specs.

Changes

config/runtime.exs — add parameters: [tcp_keepalives_*, idle_in_transaction_session_timeout] to the prod Hive.Repo config.

Verification

Confirmed against production: the orphaned lock had already cleared, spec #69 was lockable, and a live update_spec on #69 completed instantly and bumped it to revision 2.

Note on history: an earlier commit on this branch implemented this with per-transaction SET LOCAL lock_timeout and an {:error, :locked} surfaced through the MCP tools and LiveViews. That was reverted in favor of the connection-level approach above — simpler, global, and matching the reference codebase’s convention.

2. A slow LLM can’t starve the agent HTTP pool (Sentry HIVE-2)

What

Bound each agent run and give the agent HTTP pool headroom, so a slow model degrades gracefully instead of exhausting connections.

lib/hive/agents/sessions.ex — cap every Condukt run at 2 minutes (was Condukt’s 5-minute default; callers can still override :timeout).
config/config.exs — raise ReqLLM’s ReqLLM.Finch pool from the default 8 connections (stream_pool_count 8 → 16).

Why

Every agent run (revision summaries, issue triage, domain evolution) makes its LLM calls through ReqLLM’s own ReqLLM.Finch pool, which defaults to 8 connections per host. The configured model is a large, slow one, so under load — amplified by ReqLLM’s max_retries: 3, Oban max_attempts: 3, and the 15-min summary sweeper — those 8 slots starve and new calls fail with “Finch was unable to provide a connection” (HIVE-2), cascading into agent session timeouts (HIVE-A / HIVE-4).

This is isolated from SQL: every agent worker reads, releases its DB connection, makes the LLM call, then writes — the model call holds an HTTP connection, never an Ecto/DB one. So the blast radius is agent features only, never the dashboard or spec writes.

The 2-minute cap frees an agents-queue worker quickly instead of letting a stalled run linger for 5 minutes; the larger pool absorbs retry/teardown churn. Real concurrency is bounded by the :agents queue, so the extra connections add no endpoint load.

Not in this PR (follow-ups)

Per-task model tiers (root cause of the slowness). Every agent shares one model (HIVE_LLM_MODEL), so the heavyweight code model runs even trivial background jobs — revision-diff summaries, issue/drop classification, domain evolution. Follow-up: split into tiers — a fast, light model for the background structured agents and a stronger model only for the user-facing Slack ConversationAgent. Sessions.run/run_operation already merges a per-call model: override, so this is mainly config plumbing (e.g. a HIVE_LLM_MODEL_FAST) plus each agent declaring its tier. This is the real cure; the two changes above just keep a slow model from taking the agent pool down with it.
Each LLM call also emits an IO.warn-with-stacktrace because the model isn’t in ReqLLM’s catalog (“unverified model”) — log/Sentry noise, not a failure; needs an inline model spec to silence.
Hive.Agents.Tools.FetchUrlContent uses the default Req.Finch pool (shared with Slack/GitHub/RSS); routing it through the agent pool would fully isolate agent HTTP.
A no-op spec update returns a misleading spec_id has already been taken (Ecto skips the lock_version bump on an empty changeset). Tracked separately.

Verification

mix test — 741 tests, 0 failures. mix format --check-formatted and mix credo clean.

Comments

No GitHub comments yet.