Implementation plan
Run Anvil code changes in Cloudflare Containers by default.
The spike should prove a complete Slack thread to code runner loop: thread-scoped Durable Object, Forge worktree, Codex execution, tests, browser screenshots, artifact capture, and Slack/GitHub updates, with Superset preserved as an explicit fallback tool.
Target Architecture
The Think harness becomes a gateway. It reads the Slack thread, picks the runner policy, creates or resumes the thread Durable Object, and then gives the container only scoped callbacks and short-lived repo credentials.
Fetch parent and replies before planning; use mention timestamp as idempotency key.
Classify request, choose Cloudflare or Superset, enforce actor/repo policy.
Owns event journal, run state, WebSocket/status stream, container lifecycle.
One isolated Git-compatible repo per task; fallback to GitHub clone plus R2 archive.
Node 22, pnpm, Codex, Git, Forge deps, Playwright, screenshot capture.
Worker-mediated writes, signed artifact links, screenshots, traces, final evidence.
Decisions For The Spike
These are the decisions the spike should validate, not leave as open-ended architecture questions.
Use one Durable Object per Slack thread.
The DO identity is deterministic from
slack:{team_id}:{channel_id}:{thread_ts}. A new
mention in the same thread resumes the same DO, which can start a
new run if the prior run is complete.
Snapshot main after successful dev deploys.
Yes. Add an async GitHub Actions step after the dev deploy succeeds. It updates the canonical Forge base in Artifacts when available, and writes a fallback manifest plus tar/cache artifacts to R2.
Make browser validation first-class.
The initial image includes Playwright/Chromium dependencies. The
runner exposes browser_test and
capture_screenshot tools that publish screenshots,
traces, and logs as artifacts.
Gate the tool before the agent sees it.
The first pass should not rely on prompt discipline. The Think
gateway should expose the Cloudflare container tool only when a
global flag is on and the Slack thread parent was posted by an
allowlisted user, initially U0A5KQ7U3N3.
Initial Container Image
Start from the existing Forge runtime pattern instead of a generic Ubuntu image: DB-free runtime, Worker-owned secrets, and a narrow callback channel. The image should be stable enough to avoid installing browsers and package managers during each task.
Base and runtimes
Use the Cloudflare sandbox/container base or a compatible Linux base with Cloudflare container support. Install Node 22.x, pnpm 10.28.1, Git, OpenSSH client, curl, jq, ripgrep, Python 3, build-essential tooling, and CA certificates.
Agent and repo tooling
Bake in the Anvil code-runner entrypoint, Codex runtime
adapter, MCP client shims, GitHub CLI only for nonprivileged
inspection, and the same workspace/archive helpers used by
containers/agent-runtime-server.
Browser dependencies
Install Playwright and Chromium during image build with a
fixed browser path. Do not download browser binaries during
task execution. Store traces, videos, screenshots, console
logs, and network logs under /artifacts/browser.
What not to bake in
Do not copy the Forge repo into the image as the source of truth, and do not bake in Slack, GitHub App, model, database, or Superset credentials. The repo arrives per task and secrets stay Worker-side or short-lived.
Sketch Dockerfile
FROM docker.io/cloudflare/sandbox:0.12.1
ENV NODE_ENV=production \
PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \
PNPM_HOME=/usr/local/bin
RUN apt-get update && apt-get install -y --no-install-recommends \
bash ca-certificates curl git openssh-client jq unzip xz-utils \
ripgrep fd-find python3 make g++ pkg-config \
&& rm -rf /var/lib/apt/lists/*
RUN curl -fsSL \
https://github.com/pnpm/pnpm/releases/download/v10.28.1/pnpm-linux-x64 \
-o /usr/local/bin/pnpm \
&& chmod +x /usr/local/bin/pnpm
RUN mkdir -p /worktrees /artifacts /cache/pnpm /root/.codex
# Install browser dependencies at build time, not per task.
RUN pnpm dlx playwright install --with-deps chromium
# Copy only the runner server/adapter, not the Forge worktree.
COPY containers/anvil-code-runner/dist/ /runner/
COPY containers/agent-runtime-server/dist/ /runner/agent-runtime-server/
WORKDIR /worktrees
EXPOSE 8080
ENTRYPOINT ["/runner/anvil-code-runner"]
The exact entrypoint can change, but the constraints should not: reproducible browser install, no source repo in the image, and no privileged long-lived secrets in the container.
Repo Loading And Snapshot Plan
The long-term fast path is Artifacts. The spike should still work before Artifacts access by cloning GitHub and archiving state to R2.
After every successful dev deploy, refresh the canonical base.
Add a nonblocking GitHub Actions job that runs after the dev
Cloudflare deploy succeeds. It records the deployed commit SHA,
image digest, pnpm lock hash, generated client/schema hash, and
timestamp in a signed manifest. If Artifacts is available, it
imports or fast-forwards anvil/forge-main; otherwise
it writes the manifest and warm tar/cache payloads to R2.
Create an isolated task repo/worktree.
The thread DO asks the gateway for a repo credential scoped to
the task. Preferred path: fork/copy the Artifacts base repo into
anvil/task-{thread_hash}-{mention_ts} and give the
container a repo-scoped read/write token. Fallback path: shallow
or blobless clone from GitHub using a short-lived GitHub App
token.
Separate source freshness from dependency warmup.
The image owns OS/browser/tool dependencies. R2 owns reusable
caches keyed by lockfile and platform. The task still runs
pnpm install --frozen-lockfile, but it should hit
pnpm store/cache rather than the public network whenever
possible.
Persist every terminal run.
On success, failure, cancellation, or timeout, save the event journal, diff, workspace metadata, screenshots, traces, and verification logs. Artifacts is the code-state archive; R2 is still the binary/log archive.
Slack Thread To Durable Object Scope
The agent should not guess from partial context. Every Slack mention should fetch the current parent and replies first, then route through a deterministic thread DO.
Deterministic session key
Compute sessionKey =
slack:{team_id}:{channel_id}:{thread_ts}. Use that as
the Durable Object name. This removes the need for a
database lookup just to decide whether a session exists.
Mention timestamp is the run id
Store runId = mention_ts under the thread DO.
If Slack retries the event, the DO sees the same run id and
returns the existing status instead of starting another
container.
One thread can have multiple runs
The DO is per conversation, not per container. A follow-up can resume an active run, answer from completed context, or start a new run against the same branch/worktree when the user asks for more work.
Routing algorithm
- Fetch parent and replies for the Slack thread.
- Normalize requester, repo, branch, and existing Anvil status messages.
- Send the full thread packet to the Think gateway classifier.
- Resolve the deterministic thread DO and call
handleMention(runId). - The DO either resumes an active run, starts a new run, or calls the Superset fallback tool.
State owned by the DO
Store run summaries, active container id, current branch, artifact manifest, last Slack status timestamp, approval state, cancellation state, and callback tokens. Keep durable source code state in Artifacts/R2, not DO storage.
First-Pass Enablement Gate
The cleanest launch control is thread ownership, not a magic phrase.
Use the thread parent author as the default opt-in signal because it is
structured, auditable, and hard to trigger accidentally. Keep
!!! as an optional allowlisted escape hatch for testing,
but do not require normal users to learn it.
Configuration
Add Worker config for
ANVIL_CF_CONTAINER_RUNNER_ENABLED,
ANVIL_CF_CONTAINER_ALLOWED_THREAD_STARTER_IDS,
ANVIL_CF_CONTAINER_ALLOWED_REQUESTER_IDS, and
optional ANVIL_CF_CONTAINER_OPT_IN_PREFIX. First
deploy sets the global flag true only in the dev Anvil Worker
and allows U0A5KQ7U3N3.
Gate before tool registration
Resolve the runner policy before constructing the Think
harness tool list. If the gate returns Superset, the model
never receives the Cloudflare container tool. If it returns
Cloudflare, the model receives start_cf_code_session,
resume_cf_code_session, browser validation tools,
and the existing Superset fallback tool.
Recommended rule
Enable Cloudflare when
thread.parent.user in allowedThreadStarterIds.
Optionally enable when the mention text begins with
!!! and the requester is in
allowedRequesterIds. Strip the prefix before
sending the prompt to the coding agent.
Gate Function Sketch
type RunnerPolicy = {
runner: 'cloudflare' | 'superset';
reason:
| 'global-disabled'
| 'thread-starter-allowlist'
| 'explicit-prefix'
| 'not-allowlisted';
promptText: string;
};
export function resolveAnvilRunnerPolicy(input: {
enabled: boolean;
threadParentUserId: string | null;
requesterUserId: string;
mentionText: string;
allowedThreadStarterIds: Set<string>;
allowedRequesterIds: Set<string>;
optInPrefix?: string;
}): RunnerPolicy {
if (!input.enabled) {
return { runner: 'superset', reason: 'global-disabled', promptText: input.mentionText };
}
if (
input.threadParentUserId &&
input.allowedThreadStarterIds.has(input.threadParentUserId)
) {
return { runner: 'cloudflare', reason: 'thread-starter-allowlist', promptText: input.mentionText };
}
const prefix = input.optInPrefix ?? '!!!';
const explicit = input.mentionText.trimStart().startsWith(prefix);
if (explicit && input.allowedRequesterIds.has(input.requesterUserId)) {
return {
runner: 'cloudflare',
reason: 'explicit-prefix',
promptText: input.mentionText.trimStart().slice(prefix.length).trimStart(),
};
}
return { runner: 'superset', reason: 'not-allowlisted', promptText: input.mentionText };
}
Unit tests should cover global disabled, thread parent allowlist, prefix stripping, non-allowlisted prefix denial, Slack retry idempotency, and the exact tool catalog exposed for each policy.
Browser Testing And Screenshot Capture
Browser validation belongs in the spike because it is one of the main ways the Cloudflare runner can match or exceed the current Superset developer-machine workflow.
Container-local Playwright
Use this for unmerged Forge changes. Start the app inside the task container, create a local dev auth session, run Playwright against localhost, and capture screenshots/traces before teardown.
Cloudflare Browser Run
Use this for public URLs, task preview URLs, smoke checks, PDFs, and lightweight screenshots. Do not make it the only path for unmerged code unless the container exposes a task-scoped preview.
Artifact contract
The runner emits browser/index.json with viewport,
URL, screenshot path, trace path, console summary, network errors,
and pass/fail reason so Slack and PR comments can render evidence.
Default viewports
Start with desktop 1440x1000 and mobile 390x844. Add task-specific viewports only when the agent or user asks. Keep screenshots attached to the final Slack/PR update.
Cloudflare Product Map
Keep the product set small for the spike, but design the interfaces so the fallback pieces can be swapped for Artifacts and Browser Run without changing how the Think gateway talks to the runner.
Anvil ingress, Think gateway, policy checks, Slack/GitHub webhooks, tool proxying, and public status endpoints.
CoreOne actor per Slack thread for serialization, live state, idempotency, callbacks, and container lifecycle.
CoreLinux execution environment for Codex, repo tools, package managers, tests, and local browser checks.
CoreGit-compatible, versioned file tree per task. Target substrate for forks, diffs, resume, and Git handoff.
Beta / request accessImmediate fallback for manifests, session archives, screenshots, traces, logs, tarballs, and cache payloads.
CoreAdmission buffer between Slack/GitHub events and code-session dispatch. Use with explicit container capacity checks.
AdmissionDurable prepare, run, verify, PR, wait-for-human, resume, and cleanup orchestration once the DO API is stable.
OrchestrationHeadless screenshots and browser automation for public/task-preview URLs; optional complement to in-container Playwright.
BrowserCentralize Slack, GitHub App, Superset, and model credentials; containers only receive scoped tokens.
SecurityNo-Context Agent Handoff
A new agent can use this section as the starting brief. The rest of the page explains the architecture; this section says exactly what to implement first and what counts as done.
Start here
Work in the Anvil Worker/Think harness repo, not this Forge workspace, unless the task explicitly asks for Forge runtime changes. This Forge workspace is reference material for the existing runtime contracts: container lifecycle, callback JWTs, R2/session mounts, event journals, and Worker-mediated tools.
First pass scope
Implement only the gate, Slack thread normalization, policy-specific tool catalog, deterministic code-session DO, and a no-op Cloudflare container run that proves callback, logging, and artifact publication. Do not start with PR creation, Artifacts-backed forks, or broad rollout.
Files to find first
Locate the Anvil Slack app mention route, Think harness tool
registry, Superset handoff tool that uses
ANVIL_HANDOFF_URL, Cloudflare Worker env/config,
and existing Slack status/reaction helpers. Add the new gate
beside the tool registry so the CF tool is hidden before the
model is invoked.
Do not lose the fallback
Non-allowlisted threads must behave exactly as they do today: Superset remains the default code-execution tool. The Cloudflare path is additive and gated.
Copy/Paste Implementation Prompt
You are implementing the first pass of the Anvil Cloudflare code runner.
Read the plan at https://anvil-cloudflare-container-plan.pages.dev.
Goal:
- Keep Superset as default.
- Expose the CF container code-session tool only when:
1. ANVIL_CF_CONTAINER_RUNNER_ENABLED is true, and
2. the Slack thread parent user is in ANVIL_CF_CONTAINER_ALLOWED_THREAD_STARTER_IDS
initially U0A5KQ7U3N3,
3. or the mention starts with !!! and the requester is in ANVIL_CF_CONTAINER_ALLOWED_REQUESTER_IDS.
Implement in this order:
1. Find the Anvil Slack mention route, Think harness tool registry, and Superset handoff tool.
2. Add a pure resolveAnvilRunnerPolicy() helper with unit tests.
3. Fetch/normalize full Slack thread context before planning.
4. Register tool catalogs by policy: Superset-only vs CF+browser tools+Superset fallback.
5. Add AnvilCodeSession Durable Object keyed by slack:{team}:{channel}:{thread_ts}.
6. Add start_cf_code_session no-op run: create/wake DO, start container, stream logs, emit artifact manifest.
7. Post Slack status with selected runner policy and verification artifact link.
Definition of done:
- Non-allowlisted thread never exposes CF tools.
- U0A5KQ7U3N3-started thread exposes CF tools.
- Optional !!! prefix works only for allowlisted requester and is stripped from the agent prompt.
- Slack retries are idempotent by {channel}:{mention_ts}.
- No-op CF run posts terminal Slack result and artifact/log link.
- Tests cover gate, thread normalization, tool catalog selection, idempotency, and DO resume.
- Superset fallback remains available and unchanged.
First-pass tests
Add unit tests for gate decisions, prefix stripping, non-allowed prefix denial, tool catalog selection, thread session key generation, Slack retry idempotency, and DO resume/start branching.
Manual smoke
In a dev Slack channel, test one non-allowlisted thread, one
thread started by U0A5KQ7U3N3, and one
!!!-prefixed allowlisted mention. Confirm the
selected runner is logged and posted to Slack.
First-pass non-goals
Do not implement PR creation, Artifacts-backed base snapshots, Browser Run, channel-wide rollout, or cross-repo support until the no-op container run is stable.
Handoff caveat
This Forge workspace does not contain the Anvil Slack gateway. If the agent only has this repo, first locate or request the Anvil Worker source before implementing the ingress and tool registry.
Actual Implementation Work Plan
The Anvil Slack ingress/gateway is not checked into this Forge workspace. The implementation should add a small Anvil-owned gateway layer, then reuse the existing Forge worker runtime contracts where they already solve container lifecycle, callback auth, R2 mounts, event journals, and Worker-mediated tools.
Add the Cloudflare runner gate config.
Add typed Anvil Worker config for the global enable flag,
allowed thread starter IDs, allowed requester IDs, and optional
opt-in prefix. Put parsing in a pure
code-runner/gate.ts module with tests. Default
production behavior stays Superset unless the flag and
allowlist pass.
Fetch and normalize the full Slack thread before planning.
In the Slack event handler, call
conversations.replies, identify the parent message
author, mention timestamp, requester, channel, team, and text,
and compute
slack:{team_id}:{channel_id}:{thread_ts}. Use
{channel_id}:{mention_ts} as the event idempotency
key.
Register different tool catalogs by runner policy.
If the policy is Superset, expose only the existing Superset
handoff/workspace tool. If the policy is Cloudflare, expose
start_cf_code_session,
resume_cf_code_session,
browser_test, capture_screenshot, and
the Superset fallback. Log the selected policy and reason into
the Slack progress message and event journal.
Create an Anvil code-session Durable Object.
Add an Anvil-owned AnvilCodeSession DO keyed by the
Slack thread session key. It stores run summaries, active
container identity, current branch/worktree, artifact manifest,
last Slack status timestamp, cancellation state, and retry
status. It does not store the repo itself.
Bridge the session DO to the existing runtime stack.
Reuse the patterns from
containers/worker/src/runtime-thread-agent.ts,
containers/worker/src/agent-runtime/runtime-process-controller.ts,
and containers/worker/src/agent-runtime/README.md:
Worker-owned secrets, callback JWT, R2/session mounts,
event-journal projection, and runtime-server setup. Add an
Anvil run builder that produces an
AgentRuntimeStartRequest for repo coding work
instead of overloading Forge chat thread resolution.
Implement GitHub clone first, then Artifacts base snapshots.
First pass clones Forge with a short-lived GitHub App token and archives the resulting workspace/diff to R2. Next, add the post-dev-deploy GitHub Actions refresh of the canonical base repo in Cloudflare Artifacts and fork/copy that base per Slack thread session.
Wire command and browser verification into the final report.
The container emits a verification manifest with command results, changed files, screenshots, Playwright traces, console errors, and known skipped checks. Slack and PR updates link to signed artifact URLs and clearly mark any unverified work.
Expand by config, not code changes.
After the requester-only gate is stable, add channel allowlists, repo allowlists, and a low-volume percentage gate. Keep the global kill switch and Superset fallback throughout rollout.
Implementation Spike
The spike is complete when Anvil can take a Slack thread request, produce a verified code diff in Cloudflare, and post evidence back to the same thread without creating a Superset workspace.
Thread gateway and requester-only runner policy
Add runner = cloudflare | superset, deterministic
thread DO routing, Slack thread fetch before planning, and
idempotency by mention timestamp. First deploy enables
Cloudflare only when the thread parent author is
U0A5KQ7U3N3, with optional allowlisted
!!! prefix override for targeted tests.
Container image and no-op run
Build the initial image, start it from the DO, stream logs, prove
callbacks, run node --version,
pnpm --version, and archive a tiny artifact bundle.
Forge clone and Codex diff
Use GitHub clone plus R2 archive first. Run Codex on a safe branch-scoped change, capture the diff, and post a Slack update with the changed files and verification status.
Browser validation lane
Start Forge in the container with local agent auth, run one Playwright smoke test, capture desktop and mobile screenshots, and publish signed links in the final Slack reply.
PR-capable MVP
Push a branch and open a PR through Worker-mediated GitHub tools. Keep Superset as an explicit tool for Mac-only, GUI-heavy, Docker-heavy, or over-budget sessions.
Artifact-backed base refresh
Add the post-dev-deploy GitHub Actions base refresh. Switch task worktrees from GitHub clone to Artifacts fork once account access and performance are validated.
Acceptance Criteria
Thread resume works.
A second mention in the same Slack thread reaches the same DO, sees prior run state, and either resumes or starts a new run with the existing branch context.
Verification evidence is visible.
Final Slack and PR updates include command results, screenshots when browser validation ran, and links to logs, traces, and artifact manifests.
Secrets stay outside the worktree.
Containers can call back to Worker-owned tools, but cannot read long-lived Slack, GitHub App, model, database, or Superset credentials.
Known Risks
Artifacts access and performance are the main unknowns.
Treat Artifacts as the target substrate, not the first hard dependency. The spike should ship with GitHub clone plus R2 archive, then swap in Artifacts when access, repo import/fork, token flows, and clone/mount speed are validated.
Cold starts can erase the win.
Browser binaries, package managers, and OS dependencies must be in the image. Measure clone/install time before adding complexity; add Artifacts/ArtifactFS and R2 caches based on data.
Not every task should leave Superset.
Keep the fallback for Mac-only tools, deep GUI handoff, long-running interactive debugging, Docker-in-Docker, and sessions that exceed Cloudflare container limits.
Source Notes
- Cloudflare Containers: build and push container images, call containers from Workers, and use Durable Object-backed lifecycle control. Docs
- Cloudflare Artifacts: Git-compatible, versioned file trees with Workers, REST, and Git client access; repo-scoped tokens make it a strong fit for per-task agent worktrees. Docs
- Cloudflare Browser Run: headless browser automation, screenshots, PDFs, and Playwright control from Workers. Docs
- Cloudflare GitHub Actions and Git integration: use post-deploy automation to refresh the canonical Forge base after dev deploys succeed. Docs
- Forge local facts checked in this workspace: root and Forge package engines require Node 22.x, root package manager is pnpm 10.28.1, and the current agent-runtime Dockerfile uses a DB-free runtime image with Worker-owned persistence and callbacks.
-
Runtime implementation references checked in this workspace:
containers/worker/src/runtime-thread-agent.ts,containers/worker/src/agent-runtime/runtime-process-controller.ts,containers/worker/src/routes/runtime-thread-agents.ts,containers/worker/src/agent-runtime/README.md, andpackages/common/src/thread-agent-auth.ts. The existing runtime stack supports thread-scoped agents, AgentRuntimeSandbox Durable Objects, R2/session mounts, callback JWTs, and Worker-owned tool proxying, but the Anvil Slack ingress/gateway itself is not checked into this Forge workspace. -
First-pass rollout recommendation: choose thread-parent user
allowlisting as the primary gate because Slack provides that as
structured metadata. Keep the
!!!prefix as an optional allowlisted override for targeted tests, not as the normal rollout control.