Testing: a test harness for products where the model is part of the system

This is Team’s blueprint for testing a product that mixes ordinary software with non-deterministic components, most often a large language model. The same shape applies to anything whose output you can only judge by quality (a recommender, a search ranker, a generative pipeline, a heuristic planner).

Team’s harness is TypeScript on Bun (bun test). The claude CLI drives the non-deterministic component under test. Diff-based selection compares against origin/main. Those are Team’s concrete choices. Nothing in the design below assumes a particular framework, language, or model vendor.

1. The core problem

The classic Test Pyramid sorts tests along one axis: integration scope (unit → integration → end-to-end). It assumes every test is deterministic: the same input always yields the same pass/fail.

That assumption breaks the moment part of your system is a model. The “function under test” is now a probability distribution. Run the same prompt twice and you may get different words, a different number of findings, a different tool call. You cannot assertEqual your way through it. You also cannot block a merge on a check that is red 8% of the time for no reason.

So this harness keeps the pyramid’s scope axis and adds a second axis: determinism and cost. Every test lives at a coordinate of (scope, determinism). The whole design is one discipline: push each check as far down and as far toward “deterministic” as it will go. The cheapest, most reliable place to catch a given bug is almost never an expensive model eval.

  determinism →  HIGH (deterministic, free, fast)        LOW (stochastic, paid, slow)
  scope ↓
  unit          pure-function tests                       —
  integration   in-process component tests                —
  e2e           real-dependency tests (browser/daemon)    live-model behavioral tests
  quality       static contract "tripwires"               model-as-judge evals

The two axes produce a six-layer harness. The bottom four are the classic pyramid (free, deterministic). The top two are the part most test suites lack: graded checks on stochastic behavior.

2. The six layers

                          ▲ $ cost        ▲ non-determinism
   ┌───────────────────────────────────────────────────────────┐
   │  L6  QUALITY EVALS — model-as-judge                       │ $$$ paid
   │      "is the output good?" graded 1–N, pass = floor/band  │ stochastic
   ├───────────────────────────────────────────────────────────┤
   │  L5  BEHAVIORAL E2E — drive the live model                │ $$ paid
   │      "does it DO the right thing?" deterministic guards   │ gate + periodic
   ╞══════════════════ PAID  /  FREE  LINE ════════════════════╡
   │  L4  REAL-DEPENDENCY E2E — real browser/daemon/process    │ $ slow, free
   ├───────────────────────────────────────────────────────────┤
   │  L3  INTEGRATION — real components, no external surface   │ fast, free
   ├───────────────────────────────────────────────────────────┤
   │  L2  STATIC-INVARIANT TRIPWIRES — assert on the source    │ <100ms, free
   │      grep the code/config; fail when a contract is broken │ ← the multiplier
   ├───────────────────────────────────────────────────────────┤
   │  L1  PURE UNIT — pure functions, no I/O                   │ <50ms, free
   │      the wide base                                        │
   └───────────────────────────────────────────────────────────┘

Outside these six layers sits the Golden Master: a manual, out-of-band run of the whole /team pipeline. It runs against a frozen external app and a frozen prompt. It tracks pipeline drift and compares models. It is deliberately not an L1-L6 layer, not in bun test, and not in CI. A run from inside this repo would let Team’s own context poison a test that must mirror a real outside user. See golden-master/.

L1: Pure unit (the wide base)

Pure functions: parsing, formatting, validation, version math, data transforms. No I/O, no network, no model. Hermetic: if a test needs a temp dir, make one keyed by pid+timestamp and delete it after. This is where most of your test count lives and where each test costs microseconds.

Rule of thumb: if a behavior can be expressed as f(input) -> output, it belongs here, and almost everything that can be refactored into that shape should be, precisely so it can be tested here.

L2: Static-invariant “tripwires” (the highest-leverage layer)

These tests do not run behavior. They read your own source or config and assert a contract with a pattern match. They are executable architecture documentation.

Every load-bearing rule in your codebase gets a test that fails the build when someone breaks it, in milliseconds, and without execution of anything. Example rules are “never import X from Y” and “all writes must route through this helper”. Two more are “this name must not collide with a reserved word” and “an import of this module must have no side effects”.

Canonical forms:

Forbidden-pattern tripwire. Fail if a banned call reappears. Use grep for the dangerous API. Assert zero matches outside an allowlist.
Required-routing tripwire. Fail if a sensitive operation does not go through its safe wrapper.
No-side-effect tripwire. Spawn a subprocess, import the module, assert it bound no port / wrote no file / registered no handler.
Ordering tripwire. Parse a file, assert call A precedes call B (e.g. a scan must run before the thing it protects).
Collision / drift tripwire. Assert generated artifacts are fresh, names do not shadow reserved ones, and enums in two files stay in sync.

Discipline: when you write a comment that says “NEVER do X here,” write the tripwire in the same change. A constraint without a test is a suggestion.

A tripwire asserts a contract, never a wording

Much of this repo is prose that a model reads: skill bodies, agent prompts, docs. That makes it tempting to assert on the words. Do not. A tripwire must break only when the contract breaks.

Assert these. They are what a machine reads, so a change to one is a real change:

Frontmatter keys and values (name:, model:, tools:, effort:).
Commands, flags, GraphQL fields, shell patterns, and template strings the prompt tells the model to emit.
File paths and cross-references. A renamed target must fail the build.
Section headings, and the order of two sections.
Numeric constants that bound behavior (sleep 600, timeout 1800).
The absence of a forbidden identifier or a forbidden claim. A negative sweep never breaks under a rewrite, because a rewrite does not add back the thing you banned.

Never assert these. They break on every honest edit and hold the prose still:

That a sentence uses particular words. A regex such as /never\s+approve\s+unconfirmed/ pins one phrasing of a rule, not the rule.
That two words appear near each other. Proximity spans ([^.]{0,240}, .{0,250}) test if an author put two ideas in one sentence. That is a style question, not a contract.
How long a file is. A line ceiling measures nothing about correctness, and it turns every later edit into a budget negotiation.

The test is simple: if a rewrite that preserves the meaning turns the test red, the test was measuring the wording. Delete it, or find the identifier underneath it and assert that instead. A wording pin gives false confidence in both directions. It fails when nothing broke, and it passes when an author keeps the sentence but guts the logic around it.

Behavior that only prose can carry belongs at L5 or L6, where a model judges if the instruction still lands. Do not simulate that at L2 with a regex.

This layer is what lets a stochastic product stay correct cheaply: a huge fraction of “regressions” are really contract violations. Contracts stay deterministic even when behavior does not.

L3: In-process integration

Real components wired together, but no external surface. Stand up an in-memory server fixture, call the real request handler, exercise the real dispatch path. Subprocess-snapshot tests (spawn the program, capture stdout / files written / exit code) live here too. Fast and free because nothing leaves the box.

L4: Real-dependency E2E (slow, still free)

Launch the real heavy dependency (a real browser, a real daemon, a real child process) and drive it. Reserve this for cases where the integration is the thing under test (lifecycle, concurrency, file locking, signal handling, real rendering). Mock the world, never the subject. Use env knobs to accelerate timers so a 5-minute watchdog test runs in 5 seconds.

These cost wall-clock time but no money, so they can still run on every PR.

L5: Behavioral E2E against the live model (the paid frontier)

Here the unit under test is a prompt or a policy, and you must call the real model to test it. Split this layer by determinism:

Deterministic guardrails. The output must contain, or must never contain, a specific thing. “Given a planted vulnerability, the review must flag it.” “Given a read-only mode, the agent must never call the write tool.” These pass or fail cleanly and can gate merges.
Quality benchmarks. The same behavior on an expensive model, or where the bar is subjective. These go to a non-blocking tier (see §4).

Pick the runner by what it can observe, not by convenience. A model can be driven three common ways, each with a different observation surface:

Runner style	Mechanism	Observes	Use when
Stream runner	invoke `claude`, parse streamed events	tool calls, tokens, cost	most behavioral E2E. It is stateless and repeatable.
PTY runner	drive a real terminal	the rendered UI: interactive prompts, cursor, and redraws	testing UI that the event stream cannot see
SDK runner	call the model SDK directly	raw API events	A/B-ing system prompts / config with low overhead

Three runners exist because some behavior only exists on one surface, not for redundancy. An interactive confirmation rendered to a TTY does not appear in a JSON event stream. Test the surface the user actually experiences.

L6: Quality evals (model-as-judge)

When correctness is genuinely subjective (prose quality, “are these findings real,” “is this the right recommendation”), use a second model to grade the first. Keep it cheap and honest:

Use a cheaper model as the judge than the one being judged where you can.
Cascade. Gate the expensive judge behind cheap deterministic checks. Only pay for a model call when a regex or heuristic shows the need.
Score on a rubric (1 to N). Pass on a floor or a band, never on exact match. One example is “detection rate ≥ ground-truth floor and false positives ≤ ceiling”. Another is “count within ±2 of the seeded number”. Bands are how you absorb non-determinism without flakiness.
Persist scores across runs so you can see quality drift, not only today’s pass/fail.

3. Diff-based selection: only run what the change could have broken

A paid suite cannot run every test on every change. Make selection automatic:

Each expensive test declares its source dependencies (“touchfiles”): the files and globs that, if changed, could alter its outcome.
The runner diffs the branch against origin/main and runs only the tests whose dependencies changed.
A small set of global touchfiles triggers everything when touched: the shared runner, the result store, and the selection map itself. If a change to the selector did not re-run everything, a one-line edit could deselect a test in silence. Close that hole.
An escape hatch (RUN_ALL=1 or equivalent) forces the full suite for releases and scheduled runs.

This is the difference between a paid suite that costs a few dollars per change and one nobody can afford to run.

4. Two tiers: determinism decides the gate, cost decides the cadence

Classify every paid/behavioral test as gate or periodic:

Gate. A deterministic guardrail or cheap functional check. Blocks merge. Runs on every PR (after diff-selection).
Periodic. A quality benchmark, expensive-model run, externally-dependent, or otherwise non-deterministic check. Never blocks merge. Runs on a schedule (e.g. weekly cron) or on demand, with selection forced to “all.”

The classification rule is a single question:

Can this test be red for a legitimate, non-bug reason (model variance, a flaky external service, a subjective threshold)? Yes → periodic. No → gate.

You can never gate CI on a check that is sometimes red for no reason. That trains everyone to ignore red. Gate on what is stable. Monitor what is fuzzy.

Two-axis matrix in practice:

	Deterministic	Non-deterministic
Cheap	gate, every PR	periodic (or band-gated)
Expensive	gate if rare, else periodic	periodic, scheduled

Where Team stands today: every eval fixture drives a live model. All of them are thus tiered periodic, and none can gate. The gate slot stays empty on purpose. A gate-tier check must both need paid infrastructure and be deterministic, which the matrix above makes rare. The deterministic scoring signal lives in the free suite instead: tests/code-reviewer-replay.test.ts replays the code-reviewer eval offline at L3, on every PR for every author, including forks.

5. Free vs. paid: the line that organizes everything

Draw one hard line between free (no model, no metered API, no money) and paid, and wire your tooling around it:

A default command (bun test) runs only the free tiers (L1-L4). Sub-second to low-tens-of-seconds. Run it before every commit. It must be cheap enough that not running it is never tempting.
A separate command runs the paid tiers (L5-L6), diff-selected, before shipping and in CI.
CI runs the gate subset of paid tests and blocks merge. A scheduled job runs the periodic subset and reports.

Keep the free suite genuinely free. One stray metered call buried in a “unit” test poisons the whole base. A tripwire that greps tests for forbidden paid calls is a good L2 guard for exactly this. It greps for the model CLI or the SDK import in the free test roots.

Token-consuming CI jobs are additionally restricted to trusted PR authors (OWNER, MEMBER, or COLLABORATOR). Untrusted PRs skip them: forks, Dependabot, and first-time contributors. This is a security control against token-spend griefing, not only a cost optimization. It is also forward-proofing. The periodic-evals.yml gate stays dormant until someone adds a pull_request trigger. harness-checks.yml carries the trust expression as a documented-only contract for the same reason. The control is thus already in place the moment a paid step attaches.

6. Meta-tests: test the test system

Once the harness has structure, protect the structure itself:

Coverage audit. Every component of a given class (e.g. every interactive surface) must have at least one test of the right kind. Fail CI if a new one ships without one.
Selection integrity. The diff-selection map parses, has no dangling entries, and every declared test exists.
Shard/parallel integrity. If you shard the free suite across CI workers, test that the sharding is stable and total. Every file must land in exactly one shard.
Portability detection. If you support several operating systems, scan tests for platform-fragile patterns: a hardcoded /tmp, POSIX-only shells, and path separators. Then curate a known-safe subset for the other platform’s CI.

Meta-tests are themselves L2 tripwires pointed at your own test directory.

7. Cost is a design constraint, engineered down

Treat money like latency: a budget you architect against, not an afterthought:

Diff-based selection (§3): do not run what could not have changed.
Tiering (§4): expensive runs go off the critical path.
Judge cascades (§2, L6): pay for a model only after cheap checks justify it.
Cheaper judge than generator: grade with a smaller model where viable.
Fixture replay: for an expensive deterministic check, record the costly call once into a committed fixture and replay it. Re-record it only when the inputs that define it change. Guard that with a tripwire on the input files.
Persist and compare: store every paid run’s cost and scores so regressions in spend are as visible as regressions in quality.

The goal is a paid suite that costs a few dollars per change set, not hundreds.

8. Cross-cutting principles

Hermetic by construction, real where it counts. Fakes and stubs for determinism (a fake CLI on PATH, a stub upstream server, scripted child processes). Real heavy dependencies only when the integration is the subject.
Refactor toward testability at the lowest layer. Every pure function you extract is a behavior you can pin in L1 instead of pay for in L5.
A constraint that matters gets a tripwire. L2 is where architecture decisions become enforceable. Write the tripwire with the rule.
Pin every fixed bug with a regression test named for it. One file per bug, named regression-<id>-<slug>, asserting the specific fix. They cost nothing and never let the same bug back in.
Floors and bands, not exact match, for anything stochastic. Determinism where you can engineer it. Tolerance where you cannot.
Observation surface dictates the tool. Test what the user actually sees.
No “pre-existing failure” without proof. Before blaming a flake on something other than your change, reproduce it on the base branch. Stochastic systems have invisible couplings. “Not my change” is a claim that needs a receipt.
Do not chase a coverage or lint number. Accept a “worse” pattern when it is the correct engineering choice. Fix patterns that are genuinely bugs. The number is a smell test, not a target.

9. Adopting this in a new project

A pragmatic order of operations:

Draw the free/paid line. Make bun test run only free tiers. Populate L1 (pure unit) first. It is the cheapest coverage you will ever buy.
Add tripwires (L2) for your top 10 “NEVER do X” rules. Highest leverage per line of test you will write. Each is a few minutes and prevents a class of regression forever.
Stand up in-process integration (L3) with a fixture server/handler so you can test real dispatch without the network.
Add real-dependency E2E (L4) only for the lifecycle/concurrency cases that demand it. Accelerate timers through env knobs.
If you have a model in the loop, build one behavioral runner (L5) for the surface you care about most. Write deterministic guardrails first, because they are the ones you can gate on.
Add diff-based selection and the gate/periodic split as soon as the paid suite costs enough to notice. Wire CI to run gate-on-PR, periodic-on-schedule.
Add model-as-judge evals (L6) last, with cascades and band-based passing, for the genuinely subjective quality you cannot pin any other way.
Add meta-tests once the structure is worth protecting.

You do not need all six layers on day one. You need the free/paid line, a wide L1 base, and the tripwire habit. The upper layers earn their place as the stochastic surface of your product grows.

10. One-page summary

Two axes, not one: scope and determinism/cost. Push every check down and toward deterministic.
Six layers: pure unit → tripwires → integration → real-dependency E2E → live-model behavioral → model-as-judge.
Tripwires (L2) are the multiplier: assert contracts on the source, fail in milliseconds, no execution. Write one with every “NEVER” rule.
Determinism decides the gate. Cost decides the cadence. Only deterministic checks block merges. Stochastic checks run on a schedule and pass on bands.
Diff-based selection runs only what a change could break. Global touchfiles, including the selector itself, re-run everything.
Cost is architecture: selection, tiering, judge cascades, cheaper judges, fixture replay, persisted spend.
Pick the runner by observation surface. Test what the user sees.
Hermetic where you can, real where the integration is the point.

Appendix A: Diagrams

A.1 The two-axis layer model

How the six layers sit against scope (vertical) and determinism/cost (horizontal), and where the free/paid line falls.

flowchart TB
    subgraph FREE["FREE · deterministic · runs on every commit"]
        direction TB
        L1["L1 · Pure unit<br/><i>f(input) → output, no I/O</i>"]
        L2["L2 · Static tripwires<br/><i>assert contracts on source</i>"]
        L3["L3 · In-process integration<br/><i>real components, no external surface</i>"]
        L4["L4 · Real-dependency E2E<br/><i>real browser/daemon/process</i>"]
        L1 --> L2 --> L3 --> L4
    end
    subgraph PAID["PAID · stochastic · diff-selected"]
        direction TB
        L5["L5 · Behavioral E2E<br/><i>drive the live model</i>"]
        L6["L6 · Model-as-judge evals<br/><i>graded 1–N, pass on floor/band</i>"]
        L5 --> L6
    end
    FREE ==>|"free/paid line"| PAID

    L5 --> G{"Can it be red for a<br/>non-bug reason?"}
    G -->|"No → deterministic"| GATE["GATE tier<br/>blocks merge · every PR"]
    G -->|"Yes → stochastic"| PER["PERIODIC tier<br/>scheduled · never blocks"]
    L6 --> PER

A.2 Pick the runner by observation surface

flowchart LR
    Q{"What must the<br/>test observe?"}
    Q -->|"tool calls, tokens, cost"| S["Stream runner<br/>invoke model CLI,<br/>parse streamed events"]
    Q -->|"rendered UI:<br/>prompts, cursor, redraws"| P["PTY runner<br/>drive a real terminal"]
    Q -->|"raw API events,<br/>low overhead A/B"| K["SDK runner<br/>call the model SDK"]

A.3 CI execution flow

What happens on a pull request vs. on the schedule.

flowchart TD
    PR["Pull request opened / updated"] --> FREEJOB["Free suite<br/>L1–L4 · sharded · no secrets"]
    PR --> LINT["Lint + meta-tests<br/>tripwires, selection integrity"]
    PR --> SEL["Diff vs. base branch<br/>select touched paid tests"]
    SEL --> GLOBAL{"Global touchfile<br/>changed?"}
    GLOBAL -->|yes| ALL["select ALL paid tests"]
    GLOBAL -->|no| SUBSET["select touched subset"]
    ALL --> GATEJOB
    SUBSET --> GATEJOB["GATE paid tests<br/>TIER=gate · secrets injected"]
    FREEJOB --> VERDICT{"all required<br/>checks green?"}
    LINT --> VERDICT
    GATEJOB --> VERDICT
    VERDICT -->|yes| MERGE["mergeable"]
    VERDICT -->|no| BLOCK["blocked"]

    CRON["Schedule · e.g. weekly"] --> PERJOB["PERIODIC paid tests<br/>TIER=periodic · RUN_ALL=1"]
    PERJOB --> REPORT["report cost + scores<br/>compare vs. history · non-blocking"]

Appendix B: CI skeleton

Drop-in scaffolding. It is deliberately vendor-neutral pseudo-config. Adapt the runner, CI platform, and secret names to your stack. Bracketed tokens are placeholders.

B.1 Test scripts (`package.json` example, or a `Makefile`/`justfile` equivalent)

{
  "scripts": {
    // FREE — runs on every commit. No model, no metered API, no money.
    // Must stay cheap enough that skipping it is never tempting.
    "test": "[test runner] test <free-test-roots> --exclude '<paid-glob>'",

    // FREE, sharded — for parallel CI workers (stable, total partition).
    "test:shard": "[test runner] run scripts/free-shards.[ext]",

    // PAID — diff-selected against the base branch. Run before shipping.
    "test:paid": "PAID=1 [test runner] test <paid-glob>",

    // PAID, gate subset — deterministic guardrails. What CI blocks on.
    "test:gate": "PAID=1 TIER=gate [test runner] test <paid-glob>",

    // PAID, periodic subset — quality/expensive/non-deterministic.
    "test:periodic": "PAID=1 TIER=periodic RUN_ALL=1 [test runner] test <paid-glob>",

    // PAID, ignore diff selection — for releases.
    "test:paid:all": "PAID=1 RUN_ALL=1 [test runner] test <paid-glob>",

    // Show which paid tests the current diff would select (dry run).
    "test:select": "[test runner] run scripts/select.[ext]"
  }
}

B.2 Diff-based selection (sketch)

The data structure every paid test reads to decide if it runs.

// scripts/touchfiles.ts (or .py/.go — same shape in any language)

// Each paid test name → the source globs that, if changed, could alter it.
export const TOUCHFILES: Record<string, string[]> = {
  "feature-a-guardrail": ["src/feature-a/**", "src/shared/dispatch.*"],
  "feature-b-quality":   ["src/feature-b/**", "prompts/feature-b.*"],
};

// Tier of each paid test. Deterministic → gate; stochastic/expensive → periodic.
export const TIERS: Record<string, "gate" | "periodic"> = {
  "feature-a-guardrail": "gate",
  "feature-b-quality":   "periodic",
};

// Changing any of these re-runs EVERYTHING — shared infra + the selector
// itself, so you cannot deselect a test by editing one line.
export const GLOBAL_TOUCHFILES = [
  "test/helpers/run.*",       // the shared runner
  "test/helpers/store.*",     // the result store
  "scripts/touchfiles.*",     // this file
];

export function selectTests(changed: string[], tier?: "gate" | "periodic") {
  const global = changed.some(f => GLOBAL_TOUCHFILES.some(g => match(f, g)));
  return Object.keys(TOUCHFILES).filter(name => {
    if (tier && TIERS[name] !== tier) return false;
    if (process.env.RUN_ALL === "1" || global) return true;
    return changed.some(f => TOUCHFILES[name].some(g => match(f, g)));
  });
}

A test file then guards itself:

const selected = process.env.RUN_ALL === "1"
  || selectTests(diffAgainstBase(), process.env.TIER as any).includes("feature-a-guardrail");

describe.if(selected)("feature-a guardrail", () => { /* ... */ });

B.3 Pull-request workflow (gate)

Runs on every PR. Blocks merge. The YAML below is generic GitHub-Actions style. Map it to your CI of choice.

name: ci
on:
  pull_request:
    branches: ["[base branch]"]

jobs:
  free-tests:                      # L1–L4 · fast · no secrets
    runs-on: [runner]
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }   # full history for diff selection
      - run: "[install]"
      - run: "[test runner] test"  # the free suite

  lint-and-meta:                   # tripwires + selection integrity
    runs-on: [runner]
    steps:
      - uses: actions/checkout@v4
      - run: "[install]"
      - run: "[lint command]"
      - run: "[test runner] test test/meta"   # coverage audit, selection map, shards

  gate-evals:                      # L5 gate subset · diff-selected · SECRETS
    runs-on: [runner]
    strategy:
      matrix:                      # parallelize by suite to cut wall-clock
        suite: [feature-a, feature-b, feature-c]
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - run: "[install]"
      - run: "[build]"
      - name: Run gate tests for $
        env:
          PAID: "1"
          TIER: "gate"
          MODEL_API_KEY: $
        run: "[test runner] test test/e2e-$.* --retry 2"

  # Branch protection: require free-tests + lint-and-meta + gate-evals to pass.

B.4 Scheduled workflow (periodic)

Runs the stochastic/expensive tier off the critical path. Never blocks merges.

name: evals-periodic
on:
  schedule: [{ cron: "0 6 * * 1" }]   # weekly, Monday 06:00 UTC
  workflow_dispatch: {}               # allow manual runs

jobs:
  periodic-evals:
    runs-on: [runner]
    strategy:
      matrix:
        suite: [feature-b, feature-c, external-service]
    steps:
      - uses: actions/checkout@v4
      - run: "[install]"
      - run: "[build]"
      - name: Periodic eval $
        env:
          PAID: "1"
          TIER: "periodic"
          RUN_ALL: "1"              # ignore diff: run the whole tier
          MODEL_API_KEY: $
        run: "[test runner] test test/e2e-$.* --retry 2"
      - name: Persist + compare cost/scores
        run: "[test runner] run scripts/eval-report.[ext]"   # store, diff vs. history

B.5 Branch-protection checklist

free-tests is necessary: fast, deterministic, no secrets.
lint-and-meta is necessary: tripwires and selection integrity.
gate-evals is necessary: deterministic model guardrails only.
periodic-evals is not necessary: scheduled, reports, never blocks.
Secrets exposed only to gate/periodic jobs, never to the free suite.
Fork PRs: decide deliberately if they receive model secrets. The default is no. Run free and meta only, or re-target the branch to a trusted remote.