AK
WorkExperienceCase StudiesTimelineWritingContact
Resume
AK

Case Study

Forge — LLM Batch & Realtime Orchestration

Provider-agnostic AI job infrastructure for batch and realtime workloads across OpenAI and Anthropic.

Built a crash-safe worker system where two Node daemons coordinate entirely through Postgres leases, SQL transitions, and transactional outboxes.

AI InfrastructureLLM InfrastructureWorker Orchestration / Provider Integration

Forge — LLM Batch & Realtime Orchestration

Two cooperating Node daemons coordinate LLM batch and realtime jobs through Postgres only. Most AI apps stop at calling an API; Forge handles the operational layer — batching, provider routing, leases, retries, crash recovery, quota recovery, validation, notifications, and observability.

RoleSystem Architect / Full-Stack AI Infrastructure Developer
TeamSolo project
TimelineOngoing independent project
IndustryLLM Infrastructure

What we were solving

Context & problem

LLM workloads need both batch and realtime execution, provider flexibility, retry safety, cost awareness, validation, and operational visibility. A queue-based architecture adds infrastructure and coordination complexity.

Forge explores a Postgres-centered architecture where workers coordinate through SQL state transitions, leases, and transactional outboxes — no queue, no IPC, no shared worker processes.

How we approached it

Solution

Anvil (the builder) claims eligible pending rows and shapes them into provider-compatible batches. Hammer (the executor) advances those batches through worker loops that submit, poll, collect, execute realtime jobs, recover stale leases, archive cooled batches, stamp updates, recover quota, and send notifications.

OpenAI and Anthropic ship through provider ports; the core execution model stays provider-agnostic. The two daemons share only Postgres — coordination lives entirely in SQL transitions, leases, and transactional outboxes.

Impact

Outcomes

  • - Batch and realtime LLM execution paths in one architecture.
  • - Worker coordination through Postgres without a separate queue service.
  • - Lease expiry for crash recovery and stale worker reclamation.
  • - OpenAI and Anthropic via provider ports.
  • - Provider outputs revalidated against processor response schemas before success.
  • - Operator-facing health, metrics, archive, credentials, costs, architecture surfaces.
  • - Horizontal scaling through SKIP LOCKED claims and idempotent transitions.
Forge architecture: Telix seeds demon_jobs, Anvil batches, Hammer executes through provider ports and transactional outboxes
ArchitectureScreenshot
Forge worker fleet: Anvil, Striker, Bellows, Tongs, Spark, Quench, Cinder, Brand, Stoke, Beacon
Worker fleetScreenshot
Forge batch and realtime state machine transitions
State machineScreenshot
Forge Prompt Platform v2: processors, tags, schemas, rendering, validation, output policy
Prompt Platform v2Screenshot

Behind the scenes

Tech & delivery

Stack

  • Node.js
  • TypeScript
  • PostgreSQL
  • OpenAI Batch API
  • Anthropic Batch API
  • NestJS
  • Prometheus
  • React + Vite

Challenges

  • Coordinating two daemons and nine worker loops through Postgres alone — SQL state transitions, leases, and SKIP LOCKED claiming instead of a queue service.
  • Keeping side effects consistent with state flips via transactional outboxes drained by dedicated workers.
  • Validating provider outputs fail-closed against processor response schemas before accepting success.
  • Designing provider ports so OpenAI and Anthropic batch semantics stay behind one execution model.

How I worked

  • Lease-based crash recovery and idempotent claiming make every daemon and worker safe to scale horizontally.
  • Prometheus metrics, health endpoints, and a local operator viewer expose jobs, batches, costs, and live architecture.
  • A quota recovery control loop pauses exhausted provider keys and probes them back into service.

How it holds together

Technical highlights

Postgres-only coordination

No queue, IPC, or shared process. State lives in SQL transitions, leases, and transactional outboxes.

Provider-agnostic batch execution

OpenAI and Anthropic ship in-tree through a BatchProviderPort registry.

Crash-safe workers

Leases, status transitions, recovery loops, idempotent claiming.

Transactional outboxes

Brand and notification side effects written in the same transaction as state flips.

Prompt Platform v2

Processor prompts, schemas, output policies, tags, validation, preview, promotion surfaces.

Fail-closed validation

Provider success lines revalidated against response schemas before acceptance.

Quota recovery loop

Provider keys pause on quota exhaustion and recover via no-spend probe logic.

Operator UI

Local viewer: jobs, batches, archive, processors, credentials, costs, brand, live architecture.