# Execution State — AINA Factory + PKM-to-VDS (2026-06-30)

Ali authorized full autonomous execution of the plan in
`aina-factory-map-and-academy-first-plan-2026-06-30.md`. Drive it to completion;
report progress; preservation-first on canonical PKM.

## VDS access
`cd /Users/Ali/PKM/scratch/pkm-deepdive && ./vds-exec.sh run '<cmd>'` (reads) / `runl` (writes). Host `aina-vds-tf`.

## The plan (two workstreams)
**A — Academy factory:** 1 drain dock (in flight, merge-train) · 2 merge queue (GATED on drain) · 3 containerize CI · 4 containerize lanes (MUST mount sessions/ out for PKM) · 5 validate one MCP agent in box · 6 replicate to platform+data-engine-room after 1 wk clean.
**B — Move PKM nightly off laptop:** 7 VDS producer in parallel (codex LLM, incremental) · 8 laptop→VDS session push · 9 wire outputs (hub/D1/notify from VDS, needs CF token) · 10 cut over, Mac=break-glass.

## In flight NOW (Codex lanes, gpt-5.5)
- **PKM groundwork** (PID 487923): clone pkm-monorepo→VDS, venv, map pipeline stages+side-effects, recommend isolated-build invocation. STOPS before producing. Report → `/srv/aina/ops/lanes/pkm-producer-groundwork-REPORT.md`; log `…/pkm-producer-groundwork.log`.
- **Docker base+CI** (PID 487924): build factory-base + aina-academy-ci images, validate academy CI in-container, write runner-flip runbook. Does NOT flip live runner. Report → `/srv/aina/docker/PHASE1-CI-REPORT.md`; log `/srv/aina/docker/docker-base-ci.log`.
- **Merge-train v3** (PID 229302): draining 28 open PRs (29→28→…) on free runner. Slow (serial CI).

## Verify-before-trust (lead owns)
- Read each lane's REPORT, don't trust the log's self-claim. Confirm: clone HEAD real, venv deps real, images actually built (`docker images`), CI steps actually passed in-container.
- PKM groundwork must NOT have produced/deployed/overwritten canonical — confirm `/home/ali/PKM/aliknowledgebank` untouched.

## Gates / unknowns
- Merge queue (step 2): enable only when in_review < ~10 / open PRs single digits.
- CF token for VDS-driven deploy (step 9): not found in /srv/aina-hq/.secrets; wrangler may be authed via env/token file — groundwork lane checks `wrangler whoami`.
- Step 4 session-log mount = the one way Docker could break PKM. Non-negotiable checklist item.

## Prereqs confirmed on VDS
python3.12 ✓ rsync ✓ wrangler ✓ node22 ✓ docker buildx 0.34.1 ✓ 67G free ✓. pkm-monorepo NOT present (groundwork clones it). pkm-agent consumer at /home/ali/Projects/pkm-agent.

## PROGRESS LOG
- 13:2x — Both groundwork lanes DONE + verified. PKM groundwork report (32-stage map, all side-effects flagged, isolated recipe, CF auth confirmed present) → lanes/pkm-groundwork-REPORT.md. Docker base+academy-ci images built (factory-base 1GB, academy-ci 2.15GB), CI passed in-container. Canonical snapshot untouched.
- 13:38 — Launched isolated PKM **staging build** (PID 584231) via /srv/aina/pkm-staging-build.sh → log /srv/aina/pkm-staging/build.log. Compute-only subset (transcripts.py all, gen_session_markdowns --all, build_artifact_bank, build_unified_pkm, brain_health) under HOME=/srv/aina/pkm-staging, casing symlinks, publisher env unset. NO deploy/sync/notify/LLM/Linear/snapshot/mirror. First Linux smoke test — expect path/casing/CLI shakeout.
- Dock: 30→23 open PRs (merge-train alive 229302). Board in_review still 56 (status lag — PR count is truth).
- CF auth for step 9: CONFIRMED present (wrangler OAuth ali@oscalar.com, D1+Pages write, ~/.wrangler/config/default.toml). Gap closed.

## Next actions when lanes finish
1. Verify both reports. 2. If Docker green → hold runner-flip until drain done. 3. If PKM groundwork green → I (lead) drive first isolated parallel build into /srv/aina/pkm-staging, verify no canonical touch, then create staging systemd timer. 4. When drain low → enable merge queue. 5. Build laptop→VDS push (Mac launchd). 6. Resolve CF auth → wire VDS outputs.

## MERGE QUEUE DECISION (Ali 2026-06-30): MERGIFY (not native), keep Team plan
Ali tired of GitHub/PRs; chose Mergify. Config drafted at scratch/pkm-deepdive/mergify/.mergify.yml (squash; queue conditions ci + local launch-path checks + PR bot-review watcher; bot-review = net, no human gate).
- **Ali's ONE step (only he can — app/permission grant): install Mergify app** → https://github.com/apps/mergify/installations/new → pick org ainative-academy → grant aina-academy. Low pressure (dock draining anyway).
- Then I: commit .mergify.yml, let Mergify validator confirm, ACTIVATE only when dock single-digits (now 19), retire release-marshal cron once proven.
- Check names: required="PR bot-review watcher"; also ci, "local launch-path checks". squash+rebase+merge all allowed.

## Docker GUARDRAIL SPEC (Ali's hard requirement — precondition on steps 4 & 5)
Ali's reason for staying disk-based = fear of agents running destructive docker/rm. Containment is architectural, enforce ALL when containerizing lanes/agents:
1. **No Docker socket in agent containers** — agents inside `--rm` boxes literally cannot run docker prune/rm/rmi. Only host-owned scripts touch docker.
2. **Precious data RO** — secrets, PKM store, repo checkout mounted read-only.
3. **Only writable mount = dedicated per-lane scratch** (`/srv/aina/agent-logs/<lane>/`), NEVER `~/.codex`/home. Blast radius = one lane's transient logs.
4. **Non-root container user** owning only its scratch dir.
5. **Cleanup = one filtered janitor** (dangling images + build cache > N days; NEVER `--volumes`, NEVER `-a`). Bind mounts are immune to prune anyway. Log dir in nightly backup set.
Detail in plan Part 7. The session-log mount (step 4) uses the dedicated scratch dir, RW, but narrow — reconciles with this spec.

## Done-means-Landed reminder
Plan doc is scratch (fine). Real changes (Dockerfiles, timers, Mac launchd, doc updates to CLAUDE.md/ARCHITECTURE.md) must be committed+pushed in their repos before "done".

## PROGRESS 14:3x
- linear-key rule RETIRED on platform: workflow 273125101 disabled_manually + removed from ruleset 16201915 (now 3 checks: validate, validate-canon, validate-founder-review-pack). Reversible.
- academy Mergify LIVE — auto-upgraded its own config format (#195 on main). Dock 30→14.
- platform PR #602 (Mergify config) OPEN, mergeable but BLOCKED on `validate` (pending on serial runner) + Workers Builds. Mergify already engaged (Merge Protections pass). MERGE WHEN validate GREEN to bootstrap platform queue.
- VISION-26 tracks platform rollout.
- NEXT TICK: merge #602 if validate green; watch academy dock toward single digits.

## PROGRESS ~14:50 (Ali at breakfast, ~1hr autonomy)
- SLACK NOISE FIXED: watchdog.sh now dedups Slack escalations (sig strips volatile numbers; only pings on CHANGED alert type). Was spamming every 15min on steady-state board:blocked=1 (AIN-215). Backup saved. agent-health escalation also deduped.
- ACADEMY FACTORY UP: ran 1 COO cycle → 6 lanes dispatched (QA go/no-go, second-verifier protocol, consent UI, AIN-100-B tests, AIN-97-T3 smoke, +1) in isolated worktrees. COO CRON RE-ENABLED (15 */2, next 16:15 EDT). Safe: isolated worktrees + Mergify lands green + academy required_conversation_resolution=FALSE (no bot-thread jam).
- DOCKER Phase 2: codex-lane:latest image BUILT (1.62GB). Lane writing PHASE2-LANE-REPORT.md + cutover runbook. VERIFY validation results (auth/loopback/session-mount) when done. Cutover (coo-ops-loop → docker run) is POST-validation; agents run native-isolated for now.
- MERGIFY: academy LIVE (landing PRs #184/#186). platform PR #602 = all 4 checks pass but BLOCKED on (a) Codex bot P1 unresolved thread (platform ruleset requires conversation resolution — academy doesn't) + (b) Mergify flagged my platform .mergify.yml uses DEPRECATED fields → needs format upgrade. TODO: resolve #602 bot thread + fix deprecated config.
- b3mvqzexh background task = the COO cycle SSH wrapper; it stays "running" because the 6 lanes hold the pipe. DO NOT TaskStop it (would SIGHUP the lanes).
- academy dock: 13 PRs draining (merge-train alive).

## PHASE 2 DOCKER: VALIDATED GREEN (~14:52)
codex-lane:latest works with ALL guardrails: subscription auth (no API key, gpt-5.5 OK), Paperclip loopback reachable inside container, host session JSONL written to /srv/aina/agent-logs/phase2-validation/codex/sessions (PKM capture survives!), no docker socket, repo RO, secrets RO, $HOME/~/.codex not mounted. Report: /srv/aina/docker/PHASE2-LANE-REPORT.md + runbook.
**CUTOVER NUANCE (lane caught it):** validation used RO repo mount, but real lanes commit/push/PR → need a WRITABLE per-lane worktree mounted (not RO repo). Handle this in the dispatcher cutover. DO NOT rush cutover unsupervised — test ONE container lane through full build→commit→push→PR→Mergify before flipping coo-ops-loop. Agents run native-isolated meanwhile (churn already solved by isolated worktrees).
## STATE FOR ALI'S RETURN: Slack quiet ✓ | factory UP (6 lanes + cron) ✓ | Mergify live academy ✓ | Docker built+validated ✓ | REMAINING: docker cutover (writable-wt + 1 e2e test), platform #602 (bot-thread + deprecated cfg)

## FULL FACTORY CYCLE SUCCESS (~15:03)
All 6 COO lanes completed end-to-end (NOT killed by SSH close — finished cleanly): AIN-190→#187(MERGED by Mergify), AIN-222→#186(MERGED), AIN-100-B→#196, AIN-231→#194, AIN-228→#189, AIN-232→#190, AIN-234→#191. Pipeline PROVEN: dispatch→isolated-worktree build→commit→push→PR→Mergify auto-merge. COO cron re-enabled sustains it (next 16:15). codex exec now 0 (cycle done); worktrees cleaned up. Open cycle PRs (#189/#190/#191/#194/#196) flow through Mergify as checks green. Loop monitoring via 15:17 wakeup (already scheduled).

## MERGIFY BUG FOUND + FIXED (~15:20) — was silently INERT
verify-don't-trust caught it: academy .mergify.yml required check-success=ci but NO check named "ci" exists (real: "local launch-path checks" + "PR bot-review watcher"). Mergify skipped EVERY PR; the 3 "merged" were the merge-train (oscalar admin), not Mergify. FIXED: corrected check names (commit a229de71→98c570f2). Also dropped deprecated delete_head_branch → enabled native delete_branch_on_merge on academy+platform. VERIFIED: #189/190/191/196 now QUEUED by Mergify (status pending, not skipping). Factory now truly self-merges via Mergify. Platform config check names (validate/canon/founder-review-pack) are real → platform fine; #602 still blocked on bot-thread (separate).

## REGRESSION CAUGHT + RESOLVED (~15:56) — Mergify queue was blocking the working merger
Chain of discovery: (1) Mergify inert (bad check name) → fixed names; (2) fixed names made Mergify's queue ENGAGE but it stalls (speculative draft #199 checks never green in window — academy's ci->workflow_run->bot-review chain doesn't complete on drafts); (3) Mergify then posts "Rule: auto-queue (queue)=FAIL" check → every PR UNSTABLE/UNKNOWN → release-marshal (merges only CLEAN) merged 0 → FACTORY MERGE FLOW STALLED (regression I introduced).
RESOLUTION: Mergify → REPORT-ONLY (commit ba3f8572): dropped queue+queue_rules, kept only conflict-labeling. Mergify still does CI Insights + Test Insights (what Ali wanted). MERGING = release-marshal (serial oldest-first admin-merge of green PRs = the merge-queue behavior Ali wanted). Also hardened release-marshal to merge UNSTABLE-with-real-gates-green (bak saved). VERIFIED: #189 merged (d5562150) after the change. Other 4 cycle PRs (#190/191/194/196) clear on next release-marshal cron (*/10) once Mergify re-eval drops stale queue checks.
NET: Ali keeps merge-queue BEHAVIOR (release-marshal) + Mergify INSIGHTS. Mergify's native speculative queue = revisit later w/ dashboard if he wants its specific features (needs CI-on-draft tuning). BE TRANSPARENT w/ Ali (he was excited re Mergify-the-queue specifically).

## COHERENT MERGE PIPELINE + SELF-HEALING (~16:18)
Root cause of stuck backlog: after #189 merged, other PRs went CONFLICTING; merge-train (conflict-resolver) was one-shot & dead → nothing rebased them → release-marshal (CLEAN-only) merged 0. FIXED with coherent standing design:
- PRODUCE: COO cron 15 */2 (re-enabled)
- RESOLVE CONFLICTS: merge-train relaunched (codex, log codex-merge-train-v4.log) + WATCHDOG now auto-relaunches it (new block 4b: if CONFLICTING academy PRs>0 & no merge-train & >30min since last → relaunch; stamp .merge-train-resume-stamp). bak saved.
- MERGE: release-marshal cron */10 (hardened to merge UNSTABLE-with-real-gates-green too)
- INSIGHTS: Mergify report-only (ba3f8572)
This is self-healing: conflicts get resolved automatically, no manual relaunch. Watchdog syntax OK, hook verified present (2 matches).
NEXT TICK: verify merge-train drained the ~10 CONFLICTING PRs → release-marshal merged them → open count drops. If merge-train idles with conflicts remaining, check its log.

## TICK ~16:47 — pipeline HEALTHY, draining CI backlog (NOT broken)
- Merge-train SUCCESS: rebased conflicting PRs → 10 CONFLICTING became 12 MERGEABLE, 1 CONFLICTING.
- "runner 0 procs" was a GREP ESCAPING FALSE ALARM — vds-runner is ONLINE busy=true (Runner.Listener 221285 + Worker active). Runner FINE.
- Real state: SINGLE serial runner working a CI BACKLOG — ~10 rebases each re-triggered ci+bot-review-watcher (~20 queued jobs). PRs go green one-at-a-time; release-marshal merges as each clears. merged=0 right now = pending CI, expected latency.
- release-marshal correctly WAITS on pending CI (#206/#205 ci=pending, #204 bot=pending). Will merge when green.
- OPTIMIZATION (not urgent): single runner = throughput bottleneck for whole factory; a 2nd runner would parallelize CI. Backlog drains between COO cycles (2h) for now.
- Pipeline coherent + self-healing: COO produce / merge-train+watchdog resolve / release-marshal merge / Mergify insights / serial runner CI.

## SELF-HEALING VERIFIED WORKING (~17:13) — factory autonomous + stable
- Watchdog merge-train auto-relaunch FIRED (20:30Z relaunched-merge-train(conflicting=1)) ✓ self-healing confirmed.
- Cycle PRs landing: #190/#194/#196 merged (by oscalar = merge-train/release-marshal admin). Open 13→10, CI backlog draining on serial runner.
- COO double-dispatch NOT happening: watchdog item-1 COO relaunch is `false`-guarded (log line "relaunched-via-coo-dispatcher" is cosmetic only); real production via COO cron 15 */2.
- merge-train merges green PRs despite "do not merge" instruction — harmless (green PRs only; release-marshal would do same; no fight).
- STATE: factory fully autonomous — produce(COO cron)/resolve(merge-train+watchdog self-heal)/merge(release-marshal+merge-train)/insights(Mergify report-only)/CI(serial runner). Lightening monitor cadence; system stable.
- Open follow-ups (non-urgent): docker lane cutover (writable-wt + 1 e2e test), platform #602 (bot-thread), 2nd runner for CI throughput, durable conflict-resolver already done via watchdog.

## TICK ~18:16/19:17 — healthy, disk managed
- Factory steady: ~4 merges/hr, open PRs ~4, main #210, done=137, runner idle (CI caught up), self-heal firing. 18:15 COO cycle produced 6 lanes cleanly.
- DISK: climbed 72%→81% (one-time: Docker images 5.5G + build cache + pkm-staging + CI _work). Reclaimed pkm-staging (833M, regenerable) + build cache (2.8G) → 80%, 49G free. Steady-state climb now slow (one-time builds done). Backstops: janitor 4am + watchdog 85% alert.
- Old stale ops artifacts left untouched (not mine, low confidence): /srv/aina/ops/paperclip-throughput-2026-06-19 (1.4G) etc — candidate for Ali/archival later.

## MAJOR DIRECTION CHANGE (Ali, ~evening): GITHUB-PER-TASK WAS DRIFT — go internal/hybrid
Ali's correction: system was DESIGNED as hybrid — heavy work fast+contained locally, GitHub main only after MILESTONES/EPICS (main=deploy source), NOT per-task. Per-task PR/CI/review/merge drifted in over ~10 days (proof-rails+canon-guards+pr-bot-watcher #123). I spent today DEEPENING the wrong model (Mergify/merge-train/release-marshal) instead of questioning it.
DIRECTIVES: (1) release mgmt = FRODO agent (88b49386 devops, idle) not my cron scripts — "factory does it itself"; Gimli(2fe6579c qa)=QA. (2) promote after milestones/epics not every task. (3) DEV TEAM ONLY (=Journey/Curriculum/Data/QA goals), NOT marketing. (4) remove GitHub part NOW. (5) keep internal; share Cloudflare PREVIEW URLs (wrangler) until work done. (6) NO more watchers and bots.
DONE (teardown): killed merge-train; PAUSED crons release-marshal+watchdog+coo-ops; DISABLED bots pr-bot-review-watcher+rollout-health-monitor (disabled_manually); hermes now systemd Restart=always (no watchdog needed). Production PAUSED. 12 in-flight lanes from 18:15 cycle finishing on OLD model (will push ~final PRs).
MODEL TO BUILD (Frodo internal): dev-team agents build in local worktrees→commit local `dev` branch (no push/PR/CI per-task); Frodo builds dev + wrangler Cloudflare PREVIEW → shares URL w/ Ali; at milestone Frodo integrates + Gimli QA verifies → on Ali's go promote dev→GitHub main→prod deploy. NO cron scripts — Frodo the agent runs it.
BLOCKED ON ALI: point me to the release-flow TEST RUNS (he said "ran a couple") to restore his exact design. Dev-team scope RESOLVED (Journey/Curriculum/Data/QA).
DO NOT rebuild on GitHub-per-task. Retire (don't just pause) release-marshal/merge-train/Mergify-queue once Frodo flow proven.

## PARALLEL VERIFIERS DISPATCHED (~restore Gimli model) — clearing in_review backlog
Forensics: in_review=70, but 68 HAVE branches/PRs (built by COO lanes; Paperclip didn't track → executionState=None red herring). = built-but-never-verified orphans (Gimli verify step was bypassed by GitHub-PR drift). Ali: "add 2+ Gimlis, clear the backlog."
DONE: launched 3 parallel verifier codex lanes (Gimli-1/2/3, gpt-5.5) — each ~23 of the 68. Prompt=adversarial review of each task's PR diff vs acceptance → paperclipai issue update --status done (verified) or todo (rejected w/ gap). INTERNAL (no GitHub merge). Logs /srv/aina/ops/verifiers/gimli-{1,2,3}.log. Map /srv/aina/ops/ir-map.json (identifier->internal-id). Prompt/batches in scratch/pkm-deepdive/mergify/.
NEXT: monitor verifier completion (N done/M rejected); then wire STANDING restored model — persistent verifier agent records (clone Gimli, config-get was finicky), Gimli-verify inner loop, Frodo milestone-release + wrangler preview URLs, dev-team only (Journey/Curriculum/Data/QA). Retire drift machinery permanently.
Forensics deliverable: scratch/pkm-deepdive/aina-factory-task-history-forensics-2026-06-30.{md,html} (sent to Ali).

## VERIFIERS DONE — BACKLOG CLEARED (~restore proven)
3 parallel Gimlis finished. in_review 70→4. done 137→182 (~45 verified-pass). todo 48→68 (~20 rejected-back-for-rework w/ specific gaps — genuinely adversarial, not rubber-stamp). All internal, no GitHub merges/edits. Gimli-1: 18 done. Gimli-3: 12 done (AIN-162/244/230/239/236/228...). PROVES: designed build→Gimli-verify→done model works parallelized; flood ceiling solved by N verifiers.
STILL PENDING (restored standing model): persistent verifier agent records (config-get finicky); Gimli-verify as standing inner-loop gate; Frodo milestone-release + wrangler preview URLs (dev-team only Journey/Curriculum/Data/QA); metered intake; retire drift machinery permanently. The ~20 rejected + 48 never-built todo need building under the restored (metered, Gimli-gated) model — NOT the old GitHub-per-task flow (COO still paused).

## BRAINSTORM: native self-driving Paperclip (IN PROGRESS, awaiting Ali design approval)
BIG REFRAME (research): the self-driving design ALREADY EXISTS but is switched OFF. Evidence:
- 7 native Paperclip ROUTINES defined (native scheduler, no cron) — ALL PAUSED "until lanes live/bridge proven". Roles: Atlas(Roadmap Steward=dispatch), Eowyn(QA/E2E gate), Donna(CoS digest/cost), Finch(PKM memory), + Gimli(adversarial verify), Frodo(release), Jessica(CEO).
- agent-context-map (aina-paperclip-agent-context repo): 62 agents→9 lanes→role→runtime-class→DEFAULT VERIFIER (producer/verifier separation built in). 8 lead-operators=dept heads. content-curriculum lane APPLIED+verified; rest mapped-not-applied.
- Native primitives replace ALL my external scaffolding: routines(=COO cron), agent wake/heartbeat(=on-demand), issue interactions+comments(=inter-agent invoke/mention), child:create+tree(=lead decomposition), recovery-actions(=watchdog), approvals(=founder gates), org-chart(=leads).
- ONE real problem: dispatch routine uses Hermes↔Paperclip BRIDGE (Ali wants removed) → replace w/ native assignment/wake.
ALI'S ENGINE MODEL (confirmed): HYBRID + team-head-driven. Goal→dept head assigns within team (multiple agents as task needs)→team builds→head verifies intra-dept→forward to QA(Gimli adversarial+Eowyn visual+Calibrator 2nd-verifier)→Release(Frodo at milestone→preview URL)→CEO Jessica coordinates heads+arbitrates+approves. Always-on=Jessica+8 heads+1 minimal keeper(Atlas). Members wake on assignment.
DEPT HEADS (from map): data=Laurie, platform=Richard(CTO), agentops=Jared, exec=Jessica, security=Benjamin, growth/marketing=Erlich+Harvey(PARKED), release=Frodo. TBD: content-curriculum head(Monica?), qa-lead(Root/Gimli?).
DESIGN PRESENTED to Ali (awaiting approval): activate existing design lane-by-lane native-only (content-curriculum first=already live), dev-lanes only (content/data/platform/qa), marketing+legal parked. Keeper=Atlas (Jessica stays pure oversight). REMOVE: Hermes bridge, COO, watchdog, release-marshal, merge-train, Mergify, GitHub per-task PR+bots. Git internal; Frodo promotes at milestone only.
NEXT (after Ali approves): write design spec (docs/superpowers/specs/), self-review, Ali reviews, then ce-plan→implement. HARD-GATE: no implementation until approved. Brainstorm skill active; TaskList #3 in_progress.

## DESIGN SPEC DELIVERED — awaiting Ali review (brainstorm user-review gate)
Spec: scratch/pkm-deepdive/aina-factory-native-selfdriving-design-2026-06-30.{md,html} (sent to Ali). Reconciliation VERIFIED (read 190 human turns myself + reader agent + botfix 3→61 timestamps): root cause = 06-30 per-task-GitHub machinery (mine), NOT 06-29 ANMS volume. (b) confirmed. Mergify RELOCATED into Frodo's release/GitOps team, milestone-scoped (not retired, not per-task). Design = restore native ANMS-spec-driven factory: CEO Jessica→heads→members(wake on assign)→QA(Gimli/Eowyn/Calibrator)→Frodo release/GitOps at milestone→preview URL. Native routines/wake/handoff/recovery; remove Hermes bridge+all external scaffolding; dev-lanes-first (content live); marketing/media git-free; Docker parked.
PLAN PREP: content-curriculum head=Monica. qa-release head TBD (Root/Gimli/Frodo). Routine trigger structure = pull in plan. Quiescent: codex 0, crons paused.
NEXT (after Ali approves spec): invoke writing-plans/ce-plan → implementation plan (routine re-points, lane activation sequence, head confirms) → wire lane-by-lane. HARD-GATE: no wiring until spec approved. TaskList #4 in_progress.

## VERIFICATION COMPLETE (positive) + PLAN LANDED — wiring is next (fresh focus)
VERIFIED (Ali's pre-proceed ask): (1) Workspace/git mechanic — Paperclip runs each issue in an isolated workspace DERIVED FROM ITS PROJECT (git). 216/263 project-bound; the 47 project-less are ALL done/cancelled (zero active) → active/future work is git-backed → NO home-drift. (2) Docker PARKED/non-interfering (0 containers, not in exec path; only a cloudflare-plugin doc mentions docker) — keep parked, don't remove. (3) isolated-workspaces stays ON (per-issue isolation correct; fix was project-binding not the toggle). (4) qa-release head = FRODO (release-carrier). content head = Monica.
PLAN: PKM-monorepo/docs/plans/2026-06-30-002-feat-native-selfdriving-paperclip-factory-plan.md (committed 37e926a0e, pushed). 8 units. KTD 7 = workspace mechanic. U8 = lane-default project inheritance for NEW issues (light — no backfill; active work already bound). U4 (remove watchdog/scaffolding) gated on U8. Design origin locked + archived (aina-factory-archive-2026-06-30/, commit 4969bcda0).
WIRING STATUS: U1 reconcile DONE (findings above). U2 native-wake test ATTEMPTED but blocked on CLI syntax — `paperclipai issue update <issue-UUID> --assignee-agent-id <FULL-agent-UUID> --comment ...` (NO -C/--company-id on update; issue UUID is global; agent id must be FULL uuid not 8-char prefix — 3bdfbfc6 → get full via `agent list --json`). Nothing landed (all attempts errored, no state change). Factory quiescent (crons paused, 0 lanes).
NEXT SESSION (fresh focus) — WIRING: U2 assign a content todo to Curriculum Architect (full UUID) via Monica → confirm native assignment WAKES the agent (idle→running) with no script = THE proof point. If wake fires → U3 re-point 7 routines off Hermes + always-on(Jessica+heads+Atlas) → U5 QA(Gimli/Eowyn/Calibrator) → U6 Frodo release/GitOps milestone runbook → U8 project-default → U4 REMOVE scaffolding LAST (only after native proven) → U7 replicate lanes. content-curriculum first (already applied). Content agents idle+ready: Monica 379acc14, Curriculum Architect 3bdfbfc6, Assessment 958c6092, Learner-Exp d951d059. 18 ready content todos (skip FOUNDER-DECISION ones).

---

## U2 RESULT — native self-drive PROVEN (2026-06-30 ~23:00 EDT / 2026-07-01 ~03:00 UTC)

**Verdict: core thesis PASS. Two findings that validate + deepen the plan's sequencing.**

### What was proven (live, on real hardware, gpt-5.5/codex_local, zero Claude tokens)
Assigned AIN-95 (`1a9e138e…`, "L1–L5 lesson+rubric+evaluator content coverage") to **Curriculum Architect** (`3bdfbfc6`). With **no manual trigger**, the server spun up a run `invocationSource=assignment` → the full native flow executed autonomously:
1. **CA woke on assignment** (heartbeat.enabled=false only disables the *timer*; assignment-wake is consumed natively by the server) → authored content in `/srv/repos/aina-academy` → ran 5 check scripts (coverage/rubric-schema/prose-qa/evaluator-smoke) all pass.
2. CA was **honest**: refused self-verify ("intended disposition is in_review with Calibrator, not self-review"), flagged a transient "Paperclip API unreachable" mid-run.
3. A native **automation** run (recovery-style) nudged: "Paperclip needs a disposition before this issue can continue."
4. CA **natively handed off** → "Assigning Calibrator for evaluator review handoff."
5. **Calibrator** (`b74f5dd3`) **auto-woke on the handoff** → added durable receipt `calibrator-review.md`, re-ran all 5 checks, gave a **properly-bounded done**: "done for AIN-95 content coverage… does NOT claim live LLM calibration, production threshold approval, or founder-gated threshold changes."
6. Issue → **done**. **Producer≠verifier held** (CA built, Calibrator verified — different agents). 3 runs, all `succeeded`.

Test scenarios: ✅ happy-path (zero operator action after assign) · ✅ assignment-wake (no script) · ✅ producer≠verifier · ⚠️ @mention-wake (not explicitly tested; handoff was via assignment) · ⚠️ no-assignee negative (not explicitly tested; consistent w/ observed quiescence).
NOTE: the stale "genuinely pending" comment is dated **2026-06-29** (prior-session triage), NOT this run.

### Finding F1 — DURABILITY (→ U6 is a hard prerequisite, U4 stays last)
This run's output was **uncommitted** in the working tree. Under option (b) that's *correct* (no per-task git; Frodo commits at milestone) — but U6 (Frodo milestone runbook) is the ONLY durable path, so it must exist before U4 removes the watchdog/safety nets. Preserved now in `aina-academy` **stash@{0}** (labeled) so nothing is orphaned pending U6.

### Finding F2 — WORKSPACE ISOLATION (elevates U8; sequence before broad autonomy/U7)
Agents share **ONE mutable checkout** of the real target repo `/srv/repos/aina-academy` and wrote to **whatever branch was checked out** — here `ali/ain-91-media-engine-lesson-slots` (a *different* issue's branch). So:
- Concurrent agents on the same repo **WILL collide + cross-contaminate branches** — this is the deep root of the "dirty branches/worktrees" chaos.
- The Paperclip per-issue workspace isolation applies to the project **managedFolder**, but agents actually `cd` to the shared real repo. Isolation is NOT solved for the real target repos.
- **Action:** extend U8 from "new issues inherit a project" → **"each dev issue/lane gets an isolated git worktree/branch in the real target repo"** (or Frodo assembles from per-issue managed folders). Do this BEFORE U3's full always-on / U7 multi-lane, else autonomy amplifies collisions.

### State after U2
- AIN-95 = done (assignee Calibrator). Output preserved in aina-academy stash@{0}.
- Factory still quiescent (no scheduler looping; all heartbeat.enabled=false). One redundant manual heartbeat `8b31cb90` was queued behind the auto-run (harmless).
- Hermes gateway still running (U4 target, untouched).
- **Next: U3** (re-point routines off Hermes + always-on for Jessica+heads+Atlas) — but fold in F2 (isolation) before broad autonomy.

## F2 isolation — root cause refined + native fix (2026-06-30 ~23:20 EDT)
`git worktree list` on /srv/repos/aina-academy shows isolation WAS happening via the EXTERNAL COO: worktrees `~/agent-workspaces/coo-*-20260630-*` on `ali/coo-*` branches. Pausing the COO stopped per-task worktree creation → U2 agent fell back to the shared main checkout (AIN-91 branch). So the fix = configure Paperclip's NATIVE `git_worktree` policy to replace the COO's external worktree creation.
- Native enums (from paperclipai source): defaultMode ∈ {shared_workspace, isolated_workspace, operator_branch, adapter_default}; workspaceStrategy.type ∈ {project_primary, git_worktree, adapter_managed, cloud_sandbox}; + worktreeParentDir. Current dev projects = shared_workspace/project_primary (the collision).
- Fix: set dev-lane projects → {defaultMode:isolated_workspace, workspaceStrategy:{type:git_worktree}, worktreeParentDir:…}. Bind curriculum projects' codebase to the real academy repo.
- VDS: 232G disk, 45G free (81% used — watch). aina-academy .git=319M, **347 branches** (COO per-task sprawl), many stale worktrees + 255 execution-workspace rows (metadata; managed dir only 72K on disk). Cleanup of coo-* worktrees/branches = part of U4 (removing COO scaffolding).
- Codex second-opinion dispatched on the policy JSON + apply order (infra = Codex-gated). Apply to content-curriculum first, re-run U2, confirm agent lands in an isolated worktree, then replicate. Frodo (U6) merges per-issue worktree branches at milestone.

## F2 isolation — Codex GO-WITH-CHANGES + applied to Lane 1 (2026-06-30 ~23:30 EDT)
Codex verdict GO-WITH-CHANGES on git_worktree isolation. Refinements folded in: allowIssueOverride:false; worktreeParentDir per-repo; branch names unique; **policy fixes DEFAULT cwd not shell-nav → need preflight assert cwd≠canonical checkout**; **fresh worktrees lack gitignored assets (node_modules 1.3G, .env) → provisioning required**; freeze-don't-delete old workspaces, GC later gated on >dirty-check.

**EXACT SCHEMA (from paperclipai source, `.strict()`):** executionWorkspaceStrategySchema = { type:enum, baseRef, branchTemplate, **worktreeParentDir**, **provisionCommand**, **teardownCommand** }. worktreeParentDir goes INSIDE workspaceStrategy (NOT top-level — my first attempt put it top-level → strict-reject → policy set null). Top-level policy fields: enabled, defaultMode, allowIssueOverride, defaultProjectWorkspaceId, workspaceStrategy. **provisionCommand/teardownCommand ARE the native provisioning hook** (solves Codex point 9 — no workspace setupCommand hack needed): set provisionCommand to `ln -sfn /srv/repos/aina-academy/node_modules node_modules; [ -e …/.env ] && ln -sfn …/.env .env; true`.

**APPLIED to Lane 1 (a64c0faf, already academy-bound):** `{enabled:true, defaultMode:isolated_workspace, allowIssueOverride:false, defaultProjectWorkspaceId:6d7912e4…, workspaceStrategy:{type:git_worktree, worktreeParentDir:/home/ali/paperclip-worktrees/aina-academy}}` — PERSISTED. worktree parent dir = /home/ali/paperclip-worktrees/aina-academy (/srv/worktrees needs root, no sudo). ROLLBACK snapshot: scratch/pkm-deepdive/ROLLBACK-lane1-project.json (policy was None).
Test: created AIN-265 (marker task, no deps) in Lane 1, assigned to Curriculum Architect → verifying it lands in an isolated worktree not the shared checkout. provisionCommand + preflight to add AFTER bare isolation proven.

## PRESERVATION SWEEP — nothing-lost audit + fixes (2026-06-30 ~23:55 EDT)
Ali asked to ensure nothing discussed/shared/decided is lost. Audited 3 buckets; found + fixed real at-risk work.

**Bucket 1 — strategy/design/decisions: SAFE (already on GitHub).** Native-redesign design + plan + full archive are pushed to `origin/ali/m3-context-graph-2026-06-29` in oscalar/pkm-monorepo (commits 3a78c61dd, 4969bcda0 confirmed on remote). "No upstream" was just missing tracking config, not missing data.

**Bucket 2 — agent product work: PRESERVED.**
- aina-academy: 3 dirty worktrees WIP-committed (evaluator scoring-spine/tutor code, 549 lines); 2 stashes tagged (vds-preserve-stash-0/1-20260630, incl. AIN-95); ALL local branches mirrored → `origin/vds-preserve-20260630/*`. On GitHub. ✅
- aina-platform: all local branches mirrored → `origin/vds-preserve-20260630/*` + stash tagged. On GitHub. ✅ (incl. academy-ui-demo-integration branch)
- aina-data-engine-room: CANNOT push to GitHub (hard 100MB file limit — parquet embeddings 95MB, jsonl corpora 108/153MB). BUT preserved off-GitHub: today's full bundle `/home/ali/.cache/aina-repo-bundles/srv-aina-aina-data-engine-room.bundle` (413MB, Jun30 14:01, contains current main 66a07c97 incl. +342 diverged commits) + R2 restic offsite ran today 13:30 ("13 paths to R2 DONE"). DER's 11 uncommitted = tooling junk (.claude/.codex/node_modules), not product work. ✅

**Bucket 3 — UI/design materials Ali SHARED: interim-safe, needs permanent home.** "AI Native Academy Design System" (83MB/193 files, has SKILL.md+ui_kits+surfaces+prototype — an operational design skill, NOT git-backed) + ui-lesson-previews (684K). Tarred + copied to VDS `/home/ali/shared-materials-preserve/` (Mac+VDS = 2 boxes now) w/ sha256. NOT yet in a git repo or confirmed in R2 path. **DECISION NEEDED from Ali: permanent home** — dedicated aina-design GitHub repo (recommended; it's standalone+has SKILL.md) vs into aina-academy vs PKM archive. Tarballs at scratch/pkm-deepdive/shared-materials-preserve/ + VDS.

**Cleanup safety note:** the `vds-preserve-20260630/*` backup namespace + stash tags mean any later branch/worktree GC (U4) is now safe — everything is snapshotted on GitHub first.

## U8 ISOLATION — first attempt NEGATIVE (2026-07-01 ~00:00 EDT) — needs agent-level fix
Applied git_worktree policy to Lane 1 (a64c0faf) correctly (persisted). Tested with marker task AIN-265 → Curriculum Architect. RESULT: policy did NOT isolate. Agent reported `pwd:/srv/repos/aina-academy` branch `ali/ain-91-…`; worktreeParentDir empty; WORKTREE-PROOF.md leaked into shared checkout. **Root: codex_local agents have a FIXED adapterConfig.cwd (=context repo) and the task made them `cd` into the shared product repo — the PROJECT-level execution-workspace policy doesn't intercept the adapter's fixed cwd.** Confirms Codex footgun #8 (policy fixes default cwd, not shell-nav). 
**Next for U8:** isolation must act at AGENT/ADAPTER level — per-issue worktree injected as the codex cwd (not a fixed adapterConfig.cwd) + a preflight guard that fails if cwd==canonical checkout + provisionCommand for node_modules symlink. Need to learn how Paperclip feeds an execution-workspace path into the codex_local adapter cwd (does it override adapterConfig.cwd when policy=git_worktree + agent cwd unset?). Lane 1 policy left in place (harmless, correct-but-insufficient). Test artifacts cleaned (marker removed, AIN-265 cancelled).

## TURN FACTORY ON — per-lane workspaces confirmed = Ali's model (2026-07-01 ~00:20 EDT)
Ali: per-issue isolation is over-engineering; per-TEAM/lane git workspaces is the wanted setup ("QA works in QA folder"). CONFIRMED: each team already has its own git workspace under aina-paperclip-agent-context/workspaces/<lane> (content-curriculum, platform-engineering, data-personalization, growth-media, security-privacy-compliance, qa-release, research-intelligence, executive-governance) — each a separate git repo. Cross-team collision already avoided. The only shared reach was into /srv/repos/aina-academy for product code = Frodo's release concern, and preservation is done so nothing's at risk. → NOT blocking factory-on on isolation.

**ON-SWITCH mechanism:** server runs `paperclipai run` with NO pause flag ("paused" is just the systemd Description label). Scheduler is active (scheduler-heartbeats table w/ next_run_at/last_fired_at). Nothing fires only because all agents have heartbeat.enabled=false. So ON = enable heartbeat on the always-on brains → server fires them on cadence → they survey+assign → members auto-wake (proven U2).

**Always-on set (DEV-LANES-FIRST):** Jessica(6454b8e0 ceo), Monica(379acc14 content), Richard(be6cc169 platform), Laurie(af273e31 data), Jared(a873590c agentops), Frodo(88b49386 release). DEFER growth-media (Harvey 85254289/Erlich 773887c3), security (Benjamin 51b3bd27/CCO 5fb74c15), research (Mike eb3f53b4) — parked lanes. No Atlas keeper exists; heads survey own lanes.

**Verify-before-flip:** enabled Monica heartbeat FIRST (payload {"runtimeConfig":{"heartbeat":{"enabled":true,"maxConcurrentRuns":1},"modelFallbacks":[]}}, partial-merge works, persisted). Monitoring whether server fires her autonomously (lastHeartbeatAt advance / timer-sourced run). If yes → enable the other 5 heads + set cadence. ROLLBACK-monica.json saved. agent update = `agent update <id> --payload-json <json>` (no -C).

## FACTORY MECHANISM FULLY PROVEN — head-routing works (2026-07-01 ~01:05 EDT)
CRITICAL FINDING: Paperclip (this build) has NO internal scheduler (no startScheduler/schedulerLoop/pollDueHeartbeats in source; no routine/trigger/schedule/cron CLI cmd; Monica heartbeat.enabled=true did NOT fire in 12min). It's event-driven (assignment-wakes work, U2) but periodic ticking must come from OUTSIDE (what the COO did). → "always-on via heartbeat.enabled" is NOT how it works.

BUT the full loop is PROVEN via manual head poke: fired `heartbeat run -a Monica --source timer` → Monica (content head) surveyed her team-goal, routed 4 real issues (AIN-94,183→Curriculum Architect; AIN-124,170→Learner-Exp), correctly SKIPPED the FOUNDER-DECISION issue, used real backlog not invented work. Assignments AUTO-WOKE members → 3 agents now RUNNING (Monica + Curriculum Architect on AIN-94 + Learner-Exp on AIN-124). **poke head → head surveys+assigns → members auto-wake → build. Factory alive.**

Minor: Monica noted direct comment/wake restricted by actor-boundary auth after assignment, but assignment-run creation worked (the important path).

**"ON" = minimal doorbell (the irreducible external piece):** a ~10-line timer that every ~15min fires `heartbeat run`/`agent wake --source timer` on the dev-lane heads (Monica/content, Richard/platform, Laurie/data, Jared/agentops; +Jessica coordinate; Frodo only at milestone). NO routing logic (heads route natively) — a doorbell, not the COO. heartbeat.enabled flag irrelevant to external timer (can revert Monica's). Timer = STANDING CONFIG → get Ali's explicit OK before installing. Content lane already kickstarted + running.
**PENDING Ali decision:** (a) install persistent doorbell timer? (b) scope — all dev lanes now vs content-only observe first.

## FACTORY TURNED ON — doorbell installed (2026-07-01 ~01:10 EDT) — Ali chose "Full go, all dev lanes"
Built `/srv/aina/ops/factory-keeper.sh` — minimal doorbell: wakes each IDLE dev-lane head (Jessica 6454b8e0, Monica 379acc14, Richard be6cc169, Laurie af273e31, Jared a873590c) via `heartbeat run --source timer`; idle-guard prevents pileup; NO routing logic (heads route natively). Replaces the removed COO loop. Cron: `*/15 * * * * /srv/aina/ops/factory-keeper.sh` INSTALLED. Log: /srv/aina/ops/factory-keeper.log.
GOTCHA fixed: first write via nested heredoc mangled the python quotes (`a.get(status,)` → empty status → skipped all). Rewrote via base64 + jq status parse. Kickstart run confirmed pokes fired (Jessica/Richard/Laurie/Jared poked, Monica skipped=running).
State: content lane already live (Monica+Curriculum Architect+Learner-Exp running on AIN-94/124). Other 3 dev lanes kickstarted. Verifying multi-lane spin-up.
DEFERRED heads (parked lanes, not in doorbell): growth-media (Harvey/Erlich), security (Benjamin/CCO), research (Mike). Frodo (release) NOT in doorbell — milestone-only (U6 runbook still TODO). Monica's heartbeat.enabled=true left on (harmless; ROLLBACK-monica.json exists).
REMAINING: U6 Frodo milestone release runbook; U4 remove Hermes bridge (still running) + old scaffolding (safe now — everything preserved to GitHub vds-preserve namespace); per-lane product-repo worktrees if within-lane collisions appear (deferred per Ali — per-team folders suffice).