Essay & Technical Narrative

The Babysitting Crisis: Why AI-Native Companies Can't Scale.

A technical argument for formal coordination protocols, durable semantic structures, and why the "autonomous agent" is a myth in production.

By Nikita Nosov, Altynai Akylbekova, Nick Mitushin & Ulugbek Zayniev

Every founder racing to build an "AI-native" business is currently confronting a silent, expensive crisis.

We are told that AI agents are ready to act as digital workers. We buy individual agents, specialized co-pilots, and prompt-based pipelines. We expect seamless autonomous leverage. What we get instead is AI-assisted chaos.

"The dirty secret of multi-step AI workflows is that they don’t actually scale. Without a rigorous, software-enforced boundary, unconstrained agents spend their time hallucinating parameters, inventing API calls, and silent degrading. The human is never freed—they are simply demoted to babysitting agents and fixing JSON format errors at 2 AM."

If required data, system registries, tools, credentials, or schemas are missing or offline, standard frameworks either degrade silently or enter endless, infinite loop retries, burning through massive token budgets while pretending everything is working. This is a trust boundary.

We believe that companies will not scale by adding more uncoordinated agents. They need a coordination protocol that constrains AI labor safely, transparently, and autonomously.

The goal is not to build "one very smart autonomous agent"—the goal is to build a system where humans define the "what," while the coordination runtime safely owns the "how." We call this the L2 Supervisor and L3 Specialized Worker architecture, coordinated under an immutable Playbook contract.

The Dilemma: Improvisation vs. Contract

Below is a live, structural look at how the same task—parsing startup metrics from a PDF—fails under standard, improvisational agent systems compared to how it behaves under Taskforce's bounded contracts.

01 / The Uncontrolled Route

Improvised loops and guesses.

1. Tool Improvisation[Guessing]

The agent is given access to a search API. Lacking boundaries, it guesses search parameters ("depth: extreme"), making dozens of unnecessary, expensive queries.

2. Factual Hallucination[Unchecked]

It extracts metrics from a pitch deck PDF, confusing Gross Merchandise Value with actual Annual Recurring Revenue. It writes the incorrect ARR ($2.4M) directly to the database.

3. Silent Crashing[Format Crash]

The model returns output wrapped in raw markdown fences (```json). The backend JSON parser throws a syntax exception and crashes at 2 AM.

4. Uncontrolled Action[Bounced]

It automatically drafts and sends a cold outreach email with the fake metrics. The message bounces, triggering spam blocks and blacklisting your corporate email.

02 / The Coordinated Route

Playbook constraints and safety.

1. Bounded Execution[Secured]

The playbook defines exactly which tools are authorized. The L2 Supervisor isolates the environment, allowing only validated calls while matching API criteria cleanly.

2. Factual Grounding[Verified]

A structured schema validates types. An automated L3 Judge checks assertions against the PDF source. Finding a mismatch, the grounding score falls and execution holds.

3. Autonomous Self-Repair[Healed]

The L2 Supervisor catches the syntax mismatch, intercepts the parsing error, dispatches a patch prompt, and extracts clean, validated JSON in 120ms without human noise.

4. Human Approval Gate[Operator Gate]

The playbook lists outreach as an irreversible strategic action. The runner pauses, requests an Operator digital signature, and dispatches the safe email only upon review.

The Bounded Primitive: Playbook Contracts

Taskforce rejects the idea of letting LLMs operate on raw natural language prompts whenever possible. Instead, it encapsulates workflows inside structured Playbook Contracts.

This turns ad-hoc delegation into a rigorous, verifiable system. Self-improvement in Taskforce is a safe ladder: we don't let agents loosely rewrite their own code. Instead, we evolve the contract briefs, worker sandboxes, and grader rules. Below is an interactive schema visualizer mapping how a compiled playbook locks execution constraints across our decentralized registries.

Playbook Contract Explorer

Schema Visualizer

playbook: "abrt-vc/sourcing-pipeline"
version: "v1.4.2"
task_type: information_gathering

inputs_schema:
pitch_deck_url: string(format=uri)target_stage: enum[pre-seed, seed, series-a]

allowed_workers:
- l3_pdf_reader: deterministic- l3_metrics_extractor: agentic

eval_gates:
- grader: l3_factual_grounding_checkconstraints: confidence > 0.90on_fail: trigger_repair_loop

safety_constraints:
- outreach_gate: require_human_signatureallow_retry: false

Contract Registry

Immutable Playbook Metadata

Playbook contracts are compiled into our versioned system database. Once deployed, the L2 Supervisor guarantees absolute contract immutability—protecting execution paths from emergent LLM prompt drifts, silent behavior regressions, or unauthorized code changes.

Compiler Insight

"Guarantees 100% legibility and absolute execution tracking for audit trails."

The Memory Engine: Modular Agent Memory

Standard agent setups aggregate session chat histories, prompt logs, and temporary variables into a single context window. As the workflow runs, this unstructured context bloat leads to noisy recall and silent failures.

Taskforce solves this by structuring memory into four distinct, specialized layers inspired by human cognitive science:

• Working Memory[Short-Lived]

The active context of the current task. It is lightweight, transient, and automatically cleared when a job completes to prevent context pollution.

• Episodic Memory[Traces & Runs]

Durable logs of execution traces, outcomes, errors, retries, and strategic decisions. Serves as the raw audit trail that L2 uses to learn and trigger self-repair.

• Procedural Memory[Skills & Rules]

Immutable skills, playbooks, execution runbooks, constraints, and prompt templates. Loaded directly from the Taskforce Hub as strict behavior guidelines.

• Semantic Memory[Durable Truths]

Stable, long-term knowledge about projects, tools, and rules. It persists across sessions to ensure a worker in session #5 remembers the context of session #1.

Specifically for the Semantic Memory layer, we designed a standalone, bio-inspired engine called CLARK Semantic Memory. Kept fully separate from raw session chat history, it links durable project parameters into a knowledge graph:

Bio-Inspired Retrieval (Clark's Nutcracker)

Named after the Clark's Nutcracker (Nucifraga columbiana)—a bird with the most remarkable long-term spatial memory in the animal kingdom, caching 30,000+ seeds across thousands of square miles and retrieving them 9 months later with absolute precision.

Instead of querying raw vectors, the CLARK memory engine mimics the bird's precision through a structured 3-stage retrieval loop:

1. Value Iteration

Like seasonal hippocampal growth in the bird, it propagates confidence dynamically across connected nodes in the semantic graph.

2. A* Search

Like landmark-based navigation, it selects optimal entities by PageRank × confidence, scoring with cosine similarity + temporal bonus.

3. Confidence Boost

Like neurogenesis, retrieved facts get a +0.05 confidence reinforcement, and their immediate neighbors receive +0.017.

Because CLARK organizes stable truths into a database of nodes and edges, the coordination layer always knows exactly what rules apply to which run. If a safety rule like "Never overwrite user-owned files without approval" is registered, CLARK guarantees it spreads confidently to every worker's context, preventing silent accidents.

Case Study: Startup Sourcing at ABRT VC

Taskforce is not just an abstract architecture. We built the prototype to solve our partners workflows.

Our primary design partner is ABRT AI Lab / ABRT VC, where we are building an AI-native venture fund operating system. Sourcing startups, research, metrics extraction, diligence, founder support, and investor matching are highly complex, multi-step tasks. In a normal system, letting LLMs free-roam these critical pipelines is unusable.

Below is a live, interactive simulator representing a real Taskforce Sourcing Run. Read through the steps, observe the self-healing formatting repair, and act as the Human Operator to authorize the strategic final deal sheet dispatch.

abrt-vc/playbooks/sourcing_pipeline.yaml

Workflow Steps

Fetch Pitch Decks

Metrics Extraction

Factual Grounding Check

Self-Repair Formats

Strategic Sign-off Gate

Supervisor Log Streamlatency: 120ms

[10:20:01] [L2_SUPERVISOR] Initializing VC Sourcing Run...

→ Loading registry schemas from ABRT VC hub.

✔ Playbook locked. L3 especializados registered.

The Economics of Digital Labor

We will make money as B2B software for companies that run recurring work through AI agents. The initial model is a subscription for the coordination runtime, plus usage-based pricing for agent execution and managed enterprise deployments.

Early pricing will be $1k–$5k/month for startups and small teams running a few important workflows, and $25k–$250k/year for larger companies, funds, and operations-heavy businesses that need custom playbooks, private deployment, governance, integrations, and approval controls.

The market is much larger than the current "AI agents" software category. Salesforce is a $41B+ revenue company for managing customer workflows^[1], ServiceNow is a $12B+ subscription revenue company for managing enterprise workflows^[2], and Workday is a $9B+ revenue company for managing people and finance workflows^[3]. If AI workers become a new labor layer inside companies, they will need their own operating layer.

A realistic path to $1B ARR is 20,000 companies paying an average of $50k/year, or 5,000 larger customers paying $200k/year. If Taskforce becomes the coordination layer for AI-native work, the upside is multi-billion ARR in a massive enterprise agentic market expanding over 46% annually^[4].

[ References & Market Data Sources ]

[1]Salesforce FY2026 Results: Delivers record $41.5B revenue, highlighting 2.4B agentic work units and FY2030 target of $63B revenue.

[2]ServiceNow FY2025 Results: Reports $12.9B subscription revenue with FY2026 subscription revenue guidance of $15.5B+.

[3]Workday FY2026 Results: Reports $9.55B revenue, serving 11,500+ enterprise customers, positioning as platform for people, money, and agents.

[4]Grand View Research: Enterprise Agentic AI Market Report sizing market at $2.58B in 2024, projected to expand to $24.5B by 2030 (46.2% CAGR).