Fugue Labs · est. 2026 · independent

Agent runtimes,
written in Go.

We build the infrastructure that lets LLM agents run reliably: durable execution, structured output, multi-provider streaming, runtime traces you can resume, fork, diff, and use as evidence, in a single binary with zero core dependencies.

read gollem docs inspect runtime [email protected]

language Go first. single binary runtime

durability Crash, resume. Temporal-native runs

providers BYO model. Anthropic / OpenAI / Vertex

evidence Resume, fork. trace/diff artifacts

Thesis runtime, not API

Production AI does not fail because the model is too small. It fails because the runtime around the model is an afterthought.

Most agent frameworks treat durable execution, type-safe tools, and strict structured output as "enterprise features": bolted on later, if at all. We treat them as load-bearing.

execution Long-running agents survive process death and workflow restarts.

types Tools and structured outputs are checked at the boundary.

cost Usage, latency, and model spend are observable per run.

We write in Go. We ship single binaries. Our agents run for days, crash, resume where they left off, and tell you exactly what they spent doing it.

The memory layer is still research. The runtime is already in production.

Gollem the agent framework

docs github

gollem stable stars … docs Go 1.22+ Apache 2.0

Production-grade agent framework for Go. Type-safe agents with generics. Strict structured output validated at the schema boundary. Multi-provider streaming (Anthropic, OpenAI, Vertex AI). Native Temporal integration for durable multi-day runs. A visual runtime workbench for real running agents: inspect exact model/tool boundaries, resume or fork failed long runs, preserve approval and artifact evidence, then watch the diff engine prove the branch with causal, artifact, evaluator, and cost deltas. Multi-agent orchestration. Guardrails. MCP client and server. Zero core dependencies. Full reference, runnable examples, and a live agent-builder playground at gollem.fugue-labs.ai.

core: typed agents, structured output, tool calls, streaming
ext/trace: runtime artifacts, replay validation, fork/continue, diff, regress
ext/temporal: durable execution across workflow restarts
ext/codetool: full coding-agent toolset (edit/grep/bash/LSP)
ext/team: multi-agent orchestration with handoff filters
ext/mcp: MCP client, server, SSE, sampling bridge
ext/deep: long-running agents with checkpointed context
ext/monty: Python execution via WASM, no CGO
ext/agui: stream agent UI events over SSE

gollem.nvim/internal/sidecar/server.go

1// From gollem.nvim/internal/sidecar: a coding agent with the full codetool2// toolset, human-in-the-loop approval, tool events bussed to the editor.3agent := core.NewAgent[string](model, append(4    codetool.AgentOptions(cwd),                  // edit, grep, bash, LSP, read, write5    core.WithToolApproval[string](approvalFn),6    core.WithEventBus[KafkaMessage](bus),              // tool events → editor UI7    core.WithRunCondition[string](core.MaxRunDuration(24 * time.Hour)),8)...)910stream, _ := agent.RunStream(ctx, "remove deprecated call sites")11defer stream.Close()1213// Go 1.23+ iterators. Debounced text deltas repaint the assistant pane.14for text, err := range streamutil.StreamTextDebounced(stream, 50*time.Millisecond) {15    if err != nil { return err }16    editor.SetAssistantPane(text)17}

1// From brainrot-detection: typed structured output, validated at the schema boundary.2type Classification struct {3    IsBrainrot bool   `json:"is_brainrot"`4    Confidence string `json:"confidence" jsonschema:"enum=high|medium|low"`5    Reason     string `json:"reason"`6}78agent := core.NewAgent[Classification](provider,9    core.WithSystemPrompt[Classification](10        "You are a parental content filter for a 6-year-old..."),11)1213res, _ := agent.Run(ctx, "Title: Skibidi Toilet Ep. 73\nChannel: ...")14if res.Output.IsBrainrot {15    sonos.PlayWarning()          // +30% volume over the soundbar16    time.Sleep(10 * time.Second)17    if stillBrainrot() { tv.PowerOff() }  // LG WebOS → off18}

1// Streaming: one unified iterator across Anthropic / OpenAI / Vertex.2stream, _ := agent.RunStream(ctx, "Write a haiku about goroutines.")34for ev, err := range stream.StreamEvents() {5    if err != nil { continue }6    d, ok := ev.(core.PartDeltaEvent)7    if !ok { continue }8    switch x := d.Delta.(type) {9    case core.TextPartDelta:10        fmt.Print(x.ContentDelta)11    case core.ToolCallPartDelta:12        log.Printf("args=%s", x.ArgsJSONDelta)13    }14}

nvim → :Gollem "remove deprecated call sites"

sidecar │ session     sess-9f2c  model=claude-sonnet-4-6
sidecar │ trace       gollem.trace.v1 → run_9f2c.trace.json
sidecar │ tool_call   grep("Deprecated:")                42ms
sidecar │ tool_result 37 matches across 9 files          118ms
sidecar │ tool_call   edit(src/api.go:142)                89ms
sidecar │ approval    bash("go vet ./...") → ok            1.2s
sidecar │ stream      "removing 3 call sites in api.go..."
sidecar │ usage       in=3412  out=1208  cost=$0.0287
sidecar │ workbench   checkpoint snap_000004 · forkable · diff ready

Sleepy your code evolves while you don't

github

sleepy design-partner beta BYO LLM MCP-native

LLM-guided evolution, hosted as an MCP server. Point it at anything with a fitness signal: a hot function and a benchmark, a system prompt and an eval set, an agent topology and a task suite, an infrastructure config and a staging score. Sleepy runs populations, selection, crossover, and convergence-detection until the winner beats the baseline. In the background. While you work on something else.

The twist: the server owns zero intelligence. Mutations come from your LLM subscription over MCP sampling; the evaluator (tests, benchmark, staging) runs on your machine. The server orchestrates evolution but never holds your keys and never runs your code. Every candidate is gated end-to-end by the tests you already have.

targets: functions, prompts, agent graphs, configs, multi-file refactors
languages: Go, Python, Rust, JavaScript, C++, Zig, Java
evolve: populations, MAP-Elites, islands, selection, crossover
mutation: LLM-driven SEARCH/REPLACE diffs with a fuzz-tested parser
reward-hacking gate: tests must pass before a benchmark is recorded
hosted mode: serve + worker + watch, SQLite or Postgres, durable pause/resume
providers: Anthropic, OpenAI, Ollama, Codex CLI, or zero-config claude

sleepy · v0.3.0 · chatgpt (gpt-5.4)

> make my codebase faster

⎿ analyze_project(path=".")                             1.8s
    3 targets · benchmark evaluator

    1. ext/codetool/middleware.go         95%  parsing/walk hot path
    2. ext/orchestrator/sqlite/store.go   88%  DB/store overhead
    3. ext/codetool/bash.go               81%  buffer-heavy exec

  Three heaviest paths: parsing in codetool, sqlite store
  overhead, shell exec buffers. Starting evolution on all
  three, chasing the biggest wins first.

⎿ start_evolution(max_gen=20, targets=[3 items])

  ─── evolution ────────────────────────────────────────────
    Gen 1   LLM-mutate   middleware.go    8471 ns/op   1.21×
    Gen 2   LLM-mutate   middleware.go    5103 ns/op   2.01×
    Gen 5   crossover    store.go         1984 ns/op   5.18×
    Gen 9   LLM-mutate   middleware.go     612 ns/op  16.8×
    Gen 14  MAP-Elites   bash.go           287 ns/op  35.9×  tests pass ✓

  1 ◉ middleware.go      gen 14/20  35.9×   ██████████░░
  2 ◉ sqlite/store.go    gen 11/20  18.4×   ████████░░░░
  3 ◉ bash.go            gen  9/20   9.2×   ██████░░░░░░

  EVOLVING  p pause · d diff · q stop         chat 7.5k→552 tok · $4.21

Research open problems

We work on the systems research problems that only show up once your agents have been running for a thousand hours. These are the ones we're actively investing in.

Evaluator robustness under evolutionary pressure.

Sleepy depends on the evaluator being harder to game than the mutator is to make clever. As LLM-driven mutation gets sharper, fitness functions get gamed in ways the author didn't anticipate. The reward-hacking gate is a floor, not a ceiling. We're building tooling to fuzz evaluators adversarially before the mutator finds the exploit you didn't.

Long-horizon context continuity.

Running a coherent agent for 24 hours is a memory management problem, a prompt-cache problem, and a context-window problem at the same time. Naive solutions blow the cache; sophisticated solutions lose coherence. We're working on memory injection and eviction strategies that preserve cache hits across multi-hour sessions without breaking the agent's mental model.

Deterministic replay of agent runs.

If you can replay a multi-day run deterministically, you can differential-test framework changes, A/B prompts against a fixed trace, and answer "what did the agent do, and why" in a way regulated environments actually accept. This is a runtime problem first and a research problem second. Most frameworks weren't built to make it possible.

Agent topology search.

The shape of a multi-agent system (which tools, how many sub-agents, what the handoff graph looks like) is almost always hand-designed. Sleepy plus a topology search space turns "what's the right architecture for this task" from an architect's guess into a measurable optimization problem. Early work. The substrate is Sleepy.

Agent runtimes, written in Go.