Implementation plan

Run Anvil code changes in Cloudflare Containers by default.

The spike should prove a complete Slack thread to code runner loop: thread-scoped Durable Object, Forge worktree, Codex execution, tests, browser screenshots, artifact capture, and Slack/GitHub updates, with Superset preserved as an explicit fallback tool.

Target Architecture

The Think harness becomes a gateway. It reads the Slack thread, picks the runner policy, creates or resumes the thread Durable Object, and then gives the container only scoped callbacks and short-lived repo credentials.

Ingress Slack mention

Fetch parent and replies before planning; use mention timestamp as idempotency key.

Gateway Think harness

Classify request, choose Cloudflare or Superset, enforce actor/repo policy.

Session Thread Durable Object

Owns event journal, run state, WebSocket/status stream, container lifecycle.

Worktree Artifacts repo fork

One isolated Git-compatible repo per task; fallback to GitHub clone plus R2 archive.

Execution Container runner

Node 22, pnpm, Codex, Git, Forge deps, Playwright, screenshot capture.

Outputs Slack, PR, artifacts

Worker-mediated writes, signed artifact links, screenshots, traces, final evidence.

Decisions For The Spike

These are the decisions the spike should validate, not leave as open-ended architecture questions.

Use one Durable Object per Slack thread.

The DO identity is deterministic from slack:{team_id}:{channel_id}:{thread_ts}. A new mention in the same thread resumes the same DO, which can start a new run if the prior run is complete.

Snapshot main after successful dev deploys.

Yes. Add an async GitHub Actions step after the dev deploy succeeds. It updates the canonical Forge base in Artifacts when available, and writes a fallback manifest plus tar/cache artifacts to R2.

Make browser validation first-class.

The initial image includes Playwright/Chromium dependencies. The runner exposes browser_test and capture_screenshot tools that publish screenshots, traces, and logs as artifacts.

Initial Container Image

Start from the existing Forge runtime pattern instead of a generic Ubuntu image: DB-free runtime, Worker-owned secrets, and a narrow callback channel. The image should be stable enough to avoid installing browsers and package managers during each task.

Base and runtimes

Use the Cloudflare sandbox/container base or a compatible Linux base with Cloudflare container support. Install Node 22.x, pnpm 10.28.1, Git, OpenSSH client, curl, jq, ripgrep, Python 3, build-essential tooling, and CA certificates.

Agent and repo tooling

Bake in the Anvil code-runner entrypoint, Codex runtime adapter, MCP client shims, GitHub CLI only for nonprivileged inspection, and the same workspace/archive helpers used by containers/agent-runtime-server.

Browser dependencies

Install Playwright and Chromium during image build with a fixed browser path. Do not download browser binaries during task execution. Store traces, videos, screenshots, console logs, and network logs under /artifacts/browser.

What not to bake in

Do not copy the Forge repo into the image as the source of truth, and do not bake in Slack, GitHub App, model, database, or Superset credentials. The repo arrives per task and secrets stay Worker-side or short-lived.

Sketch Dockerfile

FROM docker.io/cloudflare/sandbox:0.12.1

ENV NODE_ENV=production \
    PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \
    PNPM_HOME=/usr/local/bin

RUN apt-get update && apt-get install -y --no-install-recommends \
    bash ca-certificates curl git openssh-client jq unzip xz-utils \
    ripgrep fd-find python3 make g++ pkg-config \
  && rm -rf /var/lib/apt/lists/*

RUN curl -fsSL \
    https://github.com/pnpm/pnpm/releases/download/v10.28.1/pnpm-linux-x64 \
    -o /usr/local/bin/pnpm \
  && chmod +x /usr/local/bin/pnpm

RUN mkdir -p /worktrees /artifacts /cache/pnpm /root/.codex

# Install browser dependencies at build time, not per task.
RUN pnpm dlx playwright install --with-deps chromium

# Copy only the runner server/adapter, not the Forge worktree.
COPY containers/anvil-code-runner/dist/ /runner/
COPY containers/agent-runtime-server/dist/ /runner/agent-runtime-server/

WORKDIR /worktrees
EXPOSE 8080
ENTRYPOINT ["/runner/anvil-code-runner"]

The exact entrypoint can change, but the constraints should not: reproducible browser install, no source repo in the image, and no privileged long-lived secrets in the container.

Repo Loading And Snapshot Plan

The long-term fast path is Artifacts. The spike should still work before Artifacts access by cloning GitHub and archiving state to R2.

Base refresh

After every successful dev deploy, refresh the canonical base.

Add a nonblocking GitHub Actions job that runs after the dev Cloudflare deploy succeeds. It records the deployed commit SHA, image digest, pnpm lock hash, generated client/schema hash, and timestamp in a signed manifest. If Artifacts is available, it imports or fast-forwards anvil/forge-main; otherwise it writes the manifest and warm tar/cache payloads to R2.

Task fork

Create an isolated task repo/worktree.

The thread DO asks the gateway for a repo credential scoped to the task. Preferred path: fork/copy the Artifacts base repo into anvil/task-{thread_hash}-{mention_ts} and give the container a repo-scoped read/write token. Fallback path: shallow or blobless clone from GitHub using a short-lived GitHub App token.

Warm deps

Separate source freshness from dependency warmup.

The image owns OS/browser/tool dependencies. R2 owns reusable caches keyed by lockfile and platform. The task still runs pnpm install --frozen-lockfile, but it should hit pnpm store/cache rather than the public network whenever possible.

Archive

Persist every terminal run.

On success, failure, cancellation, or timeout, save the event journal, diff, workspace metadata, screenshots, traces, and verification logs. Artifacts is the code-state archive; R2 is still the binary/log archive.

Slack Thread To Durable Object Scope

The agent should not guess from partial context. Every Slack mention should fetch the current parent and replies first, then route through a deterministic thread DO.

Deterministic session key

Compute sessionKey = slack:{team_id}:{channel_id}:{thread_ts}. Use that as the Durable Object name. This removes the need for a database lookup just to decide whether a session exists.

Mention timestamp is the run id

Store runId = mention_ts under the thread DO. If Slack retries the event, the DO sees the same run id and returns the existing status instead of starting another container.

One thread can have multiple runs

The DO is per conversation, not per container. A follow-up can resume an active run, answer from completed context, or start a new run against the same branch/worktree when the user asks for more work.

Routing algorithm

  1. Fetch parent and replies for the Slack thread.
  2. Normalize requester, repo, branch, and existing Anvil status messages.
  3. Send the full thread packet to the Think gateway classifier.
  4. Resolve the deterministic thread DO and call handleMention(runId).
  5. The DO either resumes an active run, starts a new run, or calls the Superset fallback tool.

State owned by the DO

Store run summaries, active container id, current branch, artifact manifest, last Slack status timestamp, approval state, cancellation state, and callback tokens. Keep durable source code state in Artifacts/R2, not DO storage.

Browser Testing And Screenshot Capture

Browser validation belongs in the spike because it is one of the main ways the Cloudflare runner can match or exceed the current Superset developer-machine workflow.

Container-local Playwright

Use this for unmerged Forge changes. Start the app inside the task container, create a local dev auth session, run Playwright against localhost, and capture screenshots/traces before teardown.

Cloudflare Browser Run

Use this for public URLs, task preview URLs, smoke checks, PDFs, and lightweight screenshots. Do not make it the only path for unmerged code unless the container exposes a task-scoped preview.

Artifact contract

The runner emits browser/index.json with viewport, URL, screenshot path, trace path, console summary, network errors, and pass/fail reason so Slack and PR comments can render evidence.

Default viewports

Start with desktop 1440x1000 and mobile 390x844. Add task-specific viewports only when the agent or user asks. Keep screenshots attached to the final Slack/PR update.

Cloudflare Product Map

Keep the product set small for the spike, but design the interfaces so the fallback pieces can be swapped for Artifacts and Browser Run without changing how the Think gateway talks to the runner.

Workers

Anvil ingress, Think gateway, policy checks, Slack/GitHub webhooks, tool proxying, and public status endpoints.

Core
Durable Objects

One actor per Slack thread for serialization, live state, idempotency, callbacks, and container lifecycle.

Core
Containers

Linux execution environment for Codex, repo tools, package managers, tests, and local browser checks.

Core
Artifacts

Git-compatible, versioned file tree per task. Target substrate for forks, diffs, resume, and Git handoff.

Beta / request access
R2

Immediate fallback for manifests, session archives, screenshots, traces, logs, tarballs, and cache payloads.

Core
Queues

Admission buffer between Slack/GitHub events and code-session dispatch. Use with explicit container capacity checks.

Admission
Workflows

Durable prepare, run, verify, PR, wait-for-human, resume, and cleanup orchestration once the DO API is stable.

Orchestration
Browser Run

Headless screenshots and browser automation for public/task-preview URLs; optional complement to in-container Playwright.

Browser
Secrets Store

Centralize Slack, GitHub App, Superset, and model credentials; containers only receive scoped tokens.

Security

Implementation Spike

The spike is complete when Anvil can take a Slack thread request, produce a verified code diff in Cloudflare, and post evidence back to the same thread without creating a Superset workspace.

Step 1

Thread gateway and runner policy

Add runner = cloudflare | superset, deterministic thread DO routing, Slack thread fetch before planning, and idempotency by mention timestamp. Default Cloudflare only for an allowlisted repo/channel/requester set.

Step 2

Container image and no-op run

Build the initial image, start it from the DO, stream logs, prove callbacks, run node --version, pnpm --version, and archive a tiny artifact bundle.

Step 3

Forge clone and Codex diff

Use GitHub clone plus R2 archive first. Run Codex on a safe branch-scoped change, capture the diff, and post a Slack update with the changed files and verification status.

Step 4

Browser validation lane

Start Forge in the container with local agent auth, run one Playwright smoke test, capture desktop and mobile screenshots, and publish signed links in the final Slack reply.

Step 5

PR-capable MVP

Push a branch and open a PR through Worker-mediated GitHub tools. Keep Superset as an explicit tool for Mac-only, GUI-heavy, Docker-heavy, or over-budget sessions.

Step 6

Artifact-backed base refresh

Add the post-dev-deploy GitHub Actions base refresh. Switch task worktrees from GitHub clone to Artifacts fork once account access and performance are validated.

Acceptance Criteria

Thread resume works.

A second mention in the same Slack thread reaches the same DO, sees prior run state, and either resumes or starts a new run with the existing branch context.

Verification evidence is visible.

Final Slack and PR updates include command results, screenshots when browser validation ran, and links to logs, traces, and artifact manifests.

Secrets stay outside the worktree.

Containers can call back to Worker-owned tools, but cannot read long-lived Slack, GitHub App, model, database, or Superset credentials.

Known Risks

Artifacts access and performance are the main unknowns.

Treat Artifacts as the target substrate, not the first hard dependency. The spike should ship with GitHub clone plus R2 archive, then swap in Artifacts when access, repo import/fork, token flows, and clone/mount speed are validated.

Cold starts can erase the win.

Browser binaries, package managers, and OS dependencies must be in the image. Measure clone/install time before adding complexity; add Artifacts/ArtifactFS and R2 caches based on data.

Not every task should leave Superset.

Keep the fallback for Mac-only tools, deep GUI handoff, long-running interactive debugging, Docker-in-Docker, and sessions that exceed Cloudflare container limits.

Source Notes