Implementation plan
Run Anvil code changes in Cloudflare Containers by default.
The spike should prove a complete Slack thread to code runner loop: thread-scoped Durable Object, Forge worktree, Codex execution, tests, browser screenshots, artifact capture, and Slack/GitHub updates, with Superset preserved as an explicit fallback tool.
Target Architecture
The Think harness becomes a gateway. It reads the Slack thread, picks the runner policy, creates or resumes the thread Durable Object, and then gives the container only scoped callbacks and short-lived repo credentials.
Fetch parent and replies before planning; use mention timestamp as idempotency key.
Classify request, choose Cloudflare or Superset, enforce actor/repo policy.
Owns event journal, run state, WebSocket/status stream, container lifecycle.
One isolated Git-compatible repo per task; fallback to GitHub clone plus R2 archive.
Node 22, pnpm, Codex, Git, Forge deps, Playwright, screenshot capture.
Worker-mediated writes, signed artifact links, screenshots, traces, final evidence.
Decisions For The Spike
These are the decisions the spike should validate, not leave as open-ended architecture questions.
Use one Durable Object per Slack thread.
The DO identity is deterministic from
slack:{team_id}:{channel_id}:{thread_ts}. A new
mention in the same thread resumes the same DO, which can start a
new run if the prior run is complete.
Snapshot main after successful dev deploys.
Yes. Add an async GitHub Actions step after the dev deploy succeeds. It updates the canonical Forge base in Artifacts when available, and writes a fallback manifest plus tar/cache artifacts to R2.
Make browser validation first-class.
The initial image includes Playwright/Chromium dependencies. The
runner exposes browser_test and
capture_screenshot tools that publish screenshots,
traces, and logs as artifacts.
Initial Container Image
Start from the existing Forge runtime pattern instead of a generic Ubuntu image: DB-free runtime, Worker-owned secrets, and a narrow callback channel. The image should be stable enough to avoid installing browsers and package managers during each task.
Base and runtimes
Use the Cloudflare sandbox/container base or a compatible Linux base with Cloudflare container support. Install Node 22.x, pnpm 10.28.1, Git, OpenSSH client, curl, jq, ripgrep, Python 3, build-essential tooling, and CA certificates.
Agent and repo tooling
Bake in the Anvil code-runner entrypoint, Codex runtime
adapter, MCP client shims, GitHub CLI only for nonprivileged
inspection, and the same workspace/archive helpers used by
containers/agent-runtime-server.
Browser dependencies
Install Playwright and Chromium during image build with a
fixed browser path. Do not download browser binaries during
task execution. Store traces, videos, screenshots, console
logs, and network logs under /artifacts/browser.
What not to bake in
Do not copy the Forge repo into the image as the source of truth, and do not bake in Slack, GitHub App, model, database, or Superset credentials. The repo arrives per task and secrets stay Worker-side or short-lived.
Sketch Dockerfile
FROM docker.io/cloudflare/sandbox:0.12.1
ENV NODE_ENV=production \
PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \
PNPM_HOME=/usr/local/bin
RUN apt-get update && apt-get install -y --no-install-recommends \
bash ca-certificates curl git openssh-client jq unzip xz-utils \
ripgrep fd-find python3 make g++ pkg-config \
&& rm -rf /var/lib/apt/lists/*
RUN curl -fsSL \
https://github.com/pnpm/pnpm/releases/download/v10.28.1/pnpm-linux-x64 \
-o /usr/local/bin/pnpm \
&& chmod +x /usr/local/bin/pnpm
RUN mkdir -p /worktrees /artifacts /cache/pnpm /root/.codex
# Install browser dependencies at build time, not per task.
RUN pnpm dlx playwright install --with-deps chromium
# Copy only the runner server/adapter, not the Forge worktree.
COPY containers/anvil-code-runner/dist/ /runner/
COPY containers/agent-runtime-server/dist/ /runner/agent-runtime-server/
WORKDIR /worktrees
EXPOSE 8080
ENTRYPOINT ["/runner/anvil-code-runner"]
The exact entrypoint can change, but the constraints should not: reproducible browser install, no source repo in the image, and no privileged long-lived secrets in the container.
Repo Loading And Snapshot Plan
The long-term fast path is Artifacts. The spike should still work before Artifacts access by cloning GitHub and archiving state to R2.
After every successful dev deploy, refresh the canonical base.
Add a nonblocking GitHub Actions job that runs after the dev
Cloudflare deploy succeeds. It records the deployed commit SHA,
image digest, pnpm lock hash, generated client/schema hash, and
timestamp in a signed manifest. If Artifacts is available, it
imports or fast-forwards anvil/forge-main; otherwise
it writes the manifest and warm tar/cache payloads to R2.
Create an isolated task repo/worktree.
The thread DO asks the gateway for a repo credential scoped to
the task. Preferred path: fork/copy the Artifacts base repo into
anvil/task-{thread_hash}-{mention_ts} and give the
container a repo-scoped read/write token. Fallback path: shallow
or blobless clone from GitHub using a short-lived GitHub App
token.
Separate source freshness from dependency warmup.
The image owns OS/browser/tool dependencies. R2 owns reusable
caches keyed by lockfile and platform. The task still runs
pnpm install --frozen-lockfile, but it should hit
pnpm store/cache rather than the public network whenever
possible.
Persist every terminal run.
On success, failure, cancellation, or timeout, save the event journal, diff, workspace metadata, screenshots, traces, and verification logs. Artifacts is the code-state archive; R2 is still the binary/log archive.
Slack Thread To Durable Object Scope
The agent should not guess from partial context. Every Slack mention should fetch the current parent and replies first, then route through a deterministic thread DO.
Deterministic session key
Compute sessionKey =
slack:{team_id}:{channel_id}:{thread_ts}. Use that as
the Durable Object name. This removes the need for a
database lookup just to decide whether a session exists.
Mention timestamp is the run id
Store runId = mention_ts under the thread DO.
If Slack retries the event, the DO sees the same run id and
returns the existing status instead of starting another
container.
One thread can have multiple runs
The DO is per conversation, not per container. A follow-up can resume an active run, answer from completed context, or start a new run against the same branch/worktree when the user asks for more work.
Routing algorithm
- Fetch parent and replies for the Slack thread.
- Normalize requester, repo, branch, and existing Anvil status messages.
- Send the full thread packet to the Think gateway classifier.
- Resolve the deterministic thread DO and call
handleMention(runId). - The DO either resumes an active run, starts a new run, or calls the Superset fallback tool.
State owned by the DO
Store run summaries, active container id, current branch, artifact manifest, last Slack status timestamp, approval state, cancellation state, and callback tokens. Keep durable source code state in Artifacts/R2, not DO storage.
Browser Testing And Screenshot Capture
Browser validation belongs in the spike because it is one of the main ways the Cloudflare runner can match or exceed the current Superset developer-machine workflow.
Container-local Playwright
Use this for unmerged Forge changes. Start the app inside the task container, create a local dev auth session, run Playwright against localhost, and capture screenshots/traces before teardown.
Cloudflare Browser Run
Use this for public URLs, task preview URLs, smoke checks, PDFs, and lightweight screenshots. Do not make it the only path for unmerged code unless the container exposes a task-scoped preview.
Artifact contract
The runner emits browser/index.json with viewport,
URL, screenshot path, trace path, console summary, network errors,
and pass/fail reason so Slack and PR comments can render evidence.
Default viewports
Start with desktop 1440x1000 and mobile 390x844. Add task-specific viewports only when the agent or user asks. Keep screenshots attached to the final Slack/PR update.
Cloudflare Product Map
Keep the product set small for the spike, but design the interfaces so the fallback pieces can be swapped for Artifacts and Browser Run without changing how the Think gateway talks to the runner.
Anvil ingress, Think gateway, policy checks, Slack/GitHub webhooks, tool proxying, and public status endpoints.
CoreOne actor per Slack thread for serialization, live state, idempotency, callbacks, and container lifecycle.
CoreLinux execution environment for Codex, repo tools, package managers, tests, and local browser checks.
CoreGit-compatible, versioned file tree per task. Target substrate for forks, diffs, resume, and Git handoff.
Beta / request accessImmediate fallback for manifests, session archives, screenshots, traces, logs, tarballs, and cache payloads.
CoreAdmission buffer between Slack/GitHub events and code-session dispatch. Use with explicit container capacity checks.
AdmissionDurable prepare, run, verify, PR, wait-for-human, resume, and cleanup orchestration once the DO API is stable.
OrchestrationHeadless screenshots and browser automation for public/task-preview URLs; optional complement to in-container Playwright.
BrowserCentralize Slack, GitHub App, Superset, and model credentials; containers only receive scoped tokens.
SecurityImplementation Spike
The spike is complete when Anvil can take a Slack thread request, produce a verified code diff in Cloudflare, and post evidence back to the same thread without creating a Superset workspace.
Thread gateway and runner policy
Add runner = cloudflare | superset, deterministic
thread DO routing, Slack thread fetch before planning, and
idempotency by mention timestamp. Default Cloudflare only for an
allowlisted repo/channel/requester set.
Container image and no-op run
Build the initial image, start it from the DO, stream logs, prove
callbacks, run node --version,
pnpm --version, and archive a tiny artifact bundle.
Forge clone and Codex diff
Use GitHub clone plus R2 archive first. Run Codex on a safe branch-scoped change, capture the diff, and post a Slack update with the changed files and verification status.
Browser validation lane
Start Forge in the container with local agent auth, run one Playwright smoke test, capture desktop and mobile screenshots, and publish signed links in the final Slack reply.
PR-capable MVP
Push a branch and open a PR through Worker-mediated GitHub tools. Keep Superset as an explicit tool for Mac-only, GUI-heavy, Docker-heavy, or over-budget sessions.
Artifact-backed base refresh
Add the post-dev-deploy GitHub Actions base refresh. Switch task worktrees from GitHub clone to Artifacts fork once account access and performance are validated.
Acceptance Criteria
Thread resume works.
A second mention in the same Slack thread reaches the same DO, sees prior run state, and either resumes or starts a new run with the existing branch context.
Verification evidence is visible.
Final Slack and PR updates include command results, screenshots when browser validation ran, and links to logs, traces, and artifact manifests.
Secrets stay outside the worktree.
Containers can call back to Worker-owned tools, but cannot read long-lived Slack, GitHub App, model, database, or Superset credentials.
Known Risks
Artifacts access and performance are the main unknowns.
Treat Artifacts as the target substrate, not the first hard dependency. The spike should ship with GitHub clone plus R2 archive, then swap in Artifacts when access, repo import/fork, token flows, and clone/mount speed are validated.
Cold starts can erase the win.
Browser binaries, package managers, and OS dependencies must be in the image. Measure clone/install time before adding complexity; add Artifacts/ArtifactFS and R2 caches based on data.
Not every task should leave Superset.
Keep the fallback for Mac-only tools, deep GUI handoff, long-running interactive debugging, Docker-in-Docker, and sessions that exceed Cloudflare container limits.
Source Notes
- Cloudflare Containers: build and push container images, call containers from Workers, and use Durable Object-backed lifecycle control. Docs
- Cloudflare Artifacts: Git-compatible, versioned file trees with Workers, REST, and Git client access; repo-scoped tokens make it a strong fit for per-task agent worktrees. Docs
- Cloudflare Browser Run: headless browser automation, screenshots, PDFs, and Playwright control from Workers. Docs
- Cloudflare GitHub Actions and Git integration: use post-deploy automation to refresh the canonical Forge base after dev deploys succeed. Docs
- Forge local facts checked in this workspace: root and Forge package engines require Node 22.x, root package manager is pnpm 10.28.1, and the current agent-runtime Dockerfile uses a DB-free runtime image with Worker-owned persistence and callbacks.