Most AI companies rent intelligence and compete at the UI. We ship the control points that turn intelligence into owned, verifiable work, in the open.
Agents don't fail for lack of intelligence. They fail when knowledge is stale, retrieval is expensive, output is unverified, and capability never reaches a surface users can own. So we build every stage of the line, not one layer of it:
| Control point | Why it matters | Open source | |
|---|---|---|---|
| 01 | Capability Foundry: create capability, not wrappers | If you only rent frontier APIs, your ceiling is someone else's roadmap | II-Medical · II-Search · II-Thought |
| 02 | Governed Context: turn raw knowledge into machine-usable supply | Agents fail when knowledge is scattered, stale, or outside source boundaries | II-Commons · II-Commons-Skills |
| 03 | Retrieval Fabric: make search cheap, local, inspectable | Context is useless if agents can't search before every decision, tool call, or handoff | psql_bm25s |
| 04 | Work Harness: completion under gates, not just generation | Agent output isn't work until it survives validators, evidence review, and replanning | II-Agent · II-Researcher · Zenith |
| 05 | Owned Surfaces: land capability where work happens | Capability only compounds when people can run, fork, and extend it | CommonGround · CG-Cardbox · opencode-a2a |
Models can be rented. UIs can be copied. Control points compound.
The clearest proof that the harness layer matters: on the independent Frontier SWE benchmark, GPT-5.5 running inside Zenith ranks #1 overall, ahead of every frontier model paired with its own native harness. The identical model on its native harness ranks #5. Same model, better control loop.
| # | Model | Harness | Avg rank ↓ | Dominance ↑ |
|---|---|---|---|---|
| 1 | GPT-5.5 | 🥇 Zenith | 2.06 | 92% |
| 2 | Claude Fable | Claude Code | 2.71 | 88% |
| 3 | Claude Opus 4.8 | Claude Code | 5.06 | 71% |
| 4 | GLM-5.2 | Claude Code | 5.31 | 69% |
| 5 | GPT-5.5 | Codex (native) | 5.53 | 68% |
Metrics as reported by the Frontier SWE leaderboard. The full 15-entry table is in the Zenith results.
Zenith is our continuous-improvement harness for missions that run for days or weeks, where the dominant failure mode is premature completion. One orchestrator session reads task state each turn and decides whether to spawn workers and testers, register reusable skills, replan, or stop, all over MCP/ACP on top of Claude Code, Codex, or Hermes. In our published ablation across eight long-horizon tasks, Zenith achieves the best mean rank at less than half of RALPH's per-task cost ($176 vs $408).
📄 Technical report: From RALPH to Zenith: Designing Harnesses for Long-Running Agents
| Project | What it does | |
|---|---|---|
| II-Agent | Open general agent framework: browser, code, files, sandboxed execution, documents, slides, multi-model routing | |
| Zenith | #1 on Frontier SWE. A continuous-improvement harness for long-running agent tasks that turns Claude Code, Codex, or Hermes into a multi-agent mission orchestrator via MCP/ACP | |
| II-Researcher | Deep-research agent: query decomposition, search generation, context compression, self-critique, and cited reports. Scores 84.1 on FRAMES | |
| CommonGround | From isolated agents to shared work: records, evidence, handoffs, and decisions that persist beyond one run | |
| psql_bm25s | Postgres-native exact BM25: mutable indexes, crash recovery, replication-friendly storage, SQL-native permissions | |
| II-Commons | The knowledge supply chain: Wikipedia, PD12M, arXiv, and PubMed, parsed, embedded, indexed, and served with provenance |
Everything on the 🤗 Hugging Face hub, with weights, data, and benchmark traces included.
| Release | Type | Highlight |
|---|---|---|
| II-Medical-8B | Model | Specialist medical reasoning with SFT, RL, and safety stages |
| II-Search-4B | Model | Multi-hop search and tool-use behavior in a small model |
| II-Thought-RL-v0 | Dataset | 341,795 verified, machine-checkable RL problems across math, code, science, medicine |
| II-Medical-Reasoning-SFT | Dataset | Part of 2.2M medical reasoning rows behind the II-Medical series |
| wikipedia_en · arxiv · pd12m | Datasets | Public knowledge, processed for agents, with citations and source boundaries |
| 🏆 #1 | ⭐ 5,000+ | 🧪 341K | 🏥 2.2M | 🤗 9 + 20 | 🏭 5/5 |
|---|---|---|---|---|---|
| on Frontier SWE (Zenith) | GitHub stars across the org | verified RL problems, open | medical reasoning rows | open models + datasets | production-line stages shipped, all open |
We publish the research, the data pipelines, the retrieval infrastructure, the harnesses, and the philosophy, because an intelligence economy only compounds when its production line is inspectable and forkable. Our long-form thesis lives at Symbioism: A Third Path for the Intelligence Age (source, naturally).
Earlier experiments like CoT-Lab, Common Chronicle, and CommonGround-legacy are archived in public. Every stage of the line started as an open experiment; the ones that worked became infrastructure.
ii.inc · Blog · 🤗 Hugging Face · Symbioism