Agent runtimes,
written in Go.
We build the infrastructure that lets LLM agents run reliably: durable execution, structured output, multi-provider streaming, and guardrails, in a single binary with zero core dependencies.
Thesis runtime, not API
Production AI does not fail because the model is too small. It fails because the runtime around the model is an afterthought.
Most agent frameworks treat durable execution, type-safe tools, and strict structured output as "enterprise features": bolted on later, if at all. We treat them as load-bearing.
We write in Go. We ship single binaries. Our agents run for days, crash, resume where they left off, and tell you exactly what they spent doing it.
The memory layer is still research. The runtime is already in production.
Production-grade agent framework for Go. Type-safe agents with generics. Strict structured output validated at the schema boundary. Multi-provider streaming (Anthropic, OpenAI, Vertex AI). Native Temporal integration for durable multi-day runs. Multi-agent orchestration. Guardrails. MCP client and server. Zero core dependencies. Full reference, runnable examples, and a live agent-builder playground at gollem.fugue-labs.ai.
- core: typed agents, structured output, tool calls, streaming
- ext/temporal: durable execution across workflow restarts
- ext/codetool: full coding-agent toolset (edit/grep/bash/LSP)
- ext/team: multi-agent orchestration with handoff filters
- ext/mcp: MCP client, server, SSE, sampling bridge
- ext/deep: long-running agents with checkpointed context
- ext/monty: Python execution via WASM, no CGO
- ext/agui: stream agent UI events over SSE
1// From gollem.nvim/internal/sidecar: a coding agent with the full codetool2// toolset, human-in-the-loop approval, tool events bussed to the editor.3agent := core.NewAgent[string](model, append(4 codetool.AgentOptions(cwd), // edit, grep, bash, LSP, read, write5 core.WithToolApproval[string](approvalFn),6 core.WithEventBus[KafkaMessage](bus), // tool events → editor UI7 core.WithRunCondition[string](core.MaxRunDuration(24 * time.Hour)),8)...)910stream, _ := agent.RunStream(ctx, "remove deprecated call sites")11defer stream.Close()1213// Go 1.23+ iterators. Debounced text deltas repaint the assistant pane.14for text, err := range streamutil.StreamTextDebounced(stream, 50*time.Millisecond) {15 if err != nil { return err }16 editor.SetAssistantPane(text)17}
1// From brainrot-detection: typed structured output, validated at the schema boundary.2type Classification struct {3 IsBrainrot bool `json:"is_brainrot"`4 Confidence string `json:"confidence" jsonschema:"enum=high|medium|low"`5 Reason string `json:"reason"`6}78agent := core.NewAgent[Classification](provider,9 core.WithSystemPrompt[Classification](10 "You are a parental content filter for a 6-year-old..."),11)1213res, _ := agent.Run(ctx, "Title: Skibidi Toilet Ep. 73\nChannel: ...")14if res.Output.IsBrainrot {15 sonos.PlayWarning() // +30% volume over the soundbar16 time.Sleep(10 * time.Second)17 if stillBrainrot() { tv.PowerOff() } // LG WebOS → off18}
1// Streaming: one unified iterator across Anthropic / OpenAI / Vertex.2stream, _ := agent.RunStream(ctx, "Write a haiku about goroutines.")34for ev, err := range stream.StreamEvents() {5 if err != nil { continue }6 d, ok := ev.(core.PartDeltaEvent)7 if !ok { continue }8 switch x := d.Delta.(type) {9 case core.TextPartDelta:10 fmt.Print(x.ContentDelta)11 case core.ToolCallPartDelta:12 log.Printf("args=%s", x.ArgsJSONDelta)13 }14}
sidecar │ session sess-9f2c model=claude-sonnet-4-6 sidecar │ tool_call grep("Deprecated:") 42ms sidecar │ tool_result 37 matches across 9 files 118ms sidecar │ tool_call edit(src/api.go:142) 89ms sidecar │ approval bash("go vet ./...") → ok 1.2s sidecar │ stream "removing 3 call sites in api.go..." sidecar │ usage in=3412 out=1208 cost=$0.0287
Sleepy your code evolves while you don't
LLM-guided evolution, hosted as an MCP server. Point it at anything with a fitness signal: a hot function and a benchmark, a system prompt and an eval set, an agent topology and a task suite, an infrastructure config and a staging score. Sleepy runs populations, selection, crossover, and convergence-detection until the winner beats the baseline. In the background. While you work on something else.
The twist: the server owns zero intelligence. Mutations come from your LLM subscription over MCP sampling; the evaluator (tests, benchmark, staging) runs on your machine. The server orchestrates evolution but never holds your keys and never runs your code. Every candidate is gated end-to-end by the tests you already have.
- targets: functions, prompts, agent graphs, configs, multi-file refactors
- languages: Go, Python, Rust, JavaScript, C++, Zig, Java
- evolve: populations, MAP-Elites, islands, selection, crossover
- mutation: LLM-driven SEARCH/REPLACE diffs with a fuzz-tested parser
- reward-hacking gate: tests must pass before a benchmark is recorded
- hosted mode:
serve+worker+watch, SQLite or Postgres, durable pause/resume - providers: Anthropic, OpenAI, Ollama, Codex CLI, or zero-config
claude
> make my codebase faster ⎿ analyze_project(path=".") 1.8s 3 targets · benchmark evaluator 1. ext/codetool/middleware.go 95% parsing/walk hot path 2. ext/orchestrator/sqlite/store.go 88% DB/store overhead 3. ext/codetool/bash.go 81% buffer-heavy exec Three heaviest paths: parsing in codetool, sqlite store overhead, shell exec buffers. Starting evolution on all three, chasing the biggest wins first. ⎿ start_evolution(max_gen=20, targets=[3 items]) ─── evolution ──────────────────────────────────────────── Gen 1 LLM-mutate middleware.go 8471 ns/op 1.21× Gen 2 LLM-mutate middleware.go 5103 ns/op 2.01× Gen 5 crossover store.go 1984 ns/op 5.18× Gen 9 LLM-mutate middleware.go 612 ns/op 16.8× Gen 14 MAP-Elites bash.go 287 ns/op 35.9× tests pass ✓ 1 ◉ middleware.go gen 14/20 35.9× ██████████░░ 2 ◉ sqlite/store.go gen 11/20 18.4× ████████░░░░ 3 ◉ bash.go gen 9/20 9.2× ██████░░░░░░ EVOLVING p pause · d diff · q stop chat 7.5k→552 tok · $4.21
Research open problems
We work on the systems research problems that only show up once your agents have been running for a thousand hours. These are the ones we're actively investing in.
Evaluator robustness under evolutionary pressure. Sleepy depends on the evaluator being harder to game than the mutator is to make clever. As LLM-driven mutation gets sharper, fitness functions get gamed in ways the author didn't anticipate. The reward-hacking gate is a floor, not a ceiling. We're building tooling to fuzz evaluators adversarially before the mutator finds the exploit you didn't.
Long-horizon context continuity. Running a coherent agent for 24 hours is a memory management problem, a prompt-cache problem, and a context-window problem at the same time. Naive solutions blow the cache; sophisticated solutions lose coherence. We're working on memory injection and eviction strategies that preserve cache hits across multi-hour sessions without breaking the agent's mental model.
Deterministic replay of agent runs. If you can replay a multi-day run deterministically, you can differential-test framework changes, A/B prompts against a fixed trace, and answer "what did the agent do, and why" in a way regulated environments actually accept. This is a runtime problem first and a research problem second. Most frameworks weren't built to make it possible.
Agent topology search. The shape of a multi-agent system (which tools, how many sub-agents, what the handoff graph looks like) is almost always hand-designed. Sleepy plus a topology search space turns "what's the right architecture for this task" from an architect's guess into a measurable optimization problem. Early work. The substrate is Sleepy.