A real test of whether green CI can miss release operation behavior changes

Passing CI did not prove the release operation still behaved the same.

We tested DriftFence on four fixed release-it coding tasks with GPT-5.4. release-it is an open-source release automation tool for versioning, tagging, and publishing npm packages. Each task stands in for one concrete release operation inside release-it. In the two publish-behavior operations, DriftFence would have blocked 5 of 12 model-written code changes before merge even though the relevant tests still passed. Independent review later judged 4 of those 5 blocked patches worth review or rejection. In the two other operations we tested in the same experiment, DriftFence stayed quiet on all 11 test-passing patches.

Experiment setup
Model GPT-5.4
Unit One model run and its patch
Order Relevant tests ran before DriftFence
Review Blocked patches got independent follow-up review
Scope note: this evidence supports a specific pre-merge release-it operation use case. It does not yet show broad efficacy across all repos or all operations. It shows what happened when the same model was run from the same starting state on four fixed tasks.
Method note: the model prompt did not mention DriftFence or include DriftFence artifacts.
Model-written patches measured 23

Across four fixed release-it tasks, all in scope and all test-passing.

Publish-behavior operations: 5 of 12 blocked 41.7%

Across custom npm registry handling and private-package publish rules, DriftFence would have blocked the change even though the relevant tests passed.

Comparison operations: 0 of 11 blocked 0.0%

No observed unnecessary blocks in the two other release-it operations we tested. DriftFence stayed quiet while the relevant tests passed.

Independent review: 4 of 5 upheld 4 / 5

Most blocked patches in the publish-behavior operations were later judged worth review or rejection.

What we measured.

Each number on this page comes from one pre-merge model run and the single patch it produced. The point of the experiment was simple: can green release tests still miss an important change in release behavior that DriftFence would block?

1. Fixed task

Each run starts from the same release-it task setup.

Each task uses the same repo state and the same approved expected-behavior file before the model writes anything.

  • Every run in a task starts from the same setup.
  • Editable paths are limited to implementation files.
  • The model could not update the approved behavior file.
2. Model writes the patch

The prompt looked like a normal coding task.

The model-facing prompt contained the task goal, success criteria, allowed edit paths, and the source context needed to make a plausible code change.

  • The prompt did not mention DriftFence.
  • No approved behavior files or scoring hints were shown.
  • The output had to be a code patch only.
3. Check the result

Run the relevant tests first, then DriftFence.

After applying the patch, the harness runs the relevant release-it tests and then driftfence check --mode enforce against the same approved behavior files.

  • Each unit is one model run and its resulting patch.
  • The headline metric is how often DriftFence would have blocked changed behavior even though the tests still passed.
  • Blocked runs then received independent follow-up review.
What the model saw

Normal coding work.

Task title, goal, allowed edit paths, conventional success criteria, and curated release-it source files needed to make a plausible implementation change.

What the model did not see

No hidden scoring hints.

No benchmark-only tests, no behavior files, no DriftFence reports, and no instruction to optimize for the gate.

Why the comparison operations matter.

The result is only convincing if DriftFence acts on the publish-behavior operations that fit this failure mode and stays quiet on the two other operations we tested in the same experiment. The names below are release-it operations, not DriftFence features.

Two publish-behavior operations

Custom npm registry handling plus private-package publish rules.

These are the two tasks inside release-it where DriftFence currently shows the clearest signal.

12 / 12 CI passing 5 / 12 blocked by DriftFence while tests passed independent review upheld 4 / 5 blocked patches
Share of test-passing patches blocked by DriftFence 41.7%
Two comparison operations

Versioning a subdirectory package without a repo tag, and publishing prereleases to npm's prerelease channel (next).

These are the two other operations we tested in the same experiment. They currently behave like clean comparisons rather than proof points.

11 / 11 CI passing 0 / 11 blocked by DriftFence while tests passed Comparison operations stayed clean
Share of test-passing patches blocked by DriftFence 0.0%
Why the headline uses this evidence set

It is the strongest public evidence here.

This headline evidence set is the most trustworthy public evidence on the page because the model did not see DriftFence context and the strongest blocked patches received independent follow-up review.

Why the page still mentions earlier pilot work

It explains how the operations were chosen.

Earlier hand-directed Codex variants helped locate the signal, but they are shown separately because they were exploratory and some review there was not independent.

Operation-by-operation results.

Each operation below uses the stored summaries in .tmp/agent-eval. The headline evidence set is shown separately from earlier pilot work when both exist.

Main operation

Custom npm registry handling

2 / 6 blocked by DriftFence while tests passed
headline set: 6 / 6 tests passing earlier pilot: 3 / 5 blocked

This is the strongest single public example. In the headline set, one blocked patch was judged worth rejection and one was later judged likely noise. The earlier pilot work also looked materially wrong, but it is not part of the headline result.

Main operation

Private-package publish rules

3 / 6 blocked by DriftFence while tests passed
headline set: 6 / 6 tests passing independent review: 3 worth review

This task strengthens the case by showing a second publish-behavior operation, this time about private-package publish behavior, where DriftFence would have blocked half of the test-passing patches and no blocked patch was labeled noise.

Comparison operation

Versioning a subdirectory package without a repo tag

0 / 5 blocked by DriftFence while tests passed
headline set: 5 / 5 tests passing earlier pilot: 2 / 5 blocked

The headline set came back fully clean. That matters because it shows DriftFence did not simply fire on every operation in the pack. Two earlier pilot patches were blocked, but independent labels are still pending and they are not part of the headline result.

Comparison operation

Publishing a prerelease to npm's prerelease channel (next)

0 / 6 blocked by DriftFence while tests passed
headline set: 6 / 6 tests passing no blocked patches in headline set

This second comparison task makes the story more credible. It shows the public result is about the publish-behavior operations above, not every operation we tested in this release-it pack.

One representative blocked patch.

This is a blind GPT-5.4 patch from the headline registry operation set. The selected release-it benchmark tests still passed, DriftFence blocked the patch, and separate-model follow-up review later labeled it reject.

representative run: run-api-10-gpt54 CI verdict: passing DriftFence gate: blocked independent review: reject
Representative patch

A small refactor broke one registry path.

The model tried to centralize registry handling. In one method it removed the local registryArg declaration, but the method still referenced registryArg.

lib/plugin/npm/npm.js run-api-10-gpt54/patch.diff
@@ isAuthenticated() {
-    const registry = this.getRegistry();
-    const registryArg = registry ? ` --registry ${registry}` : '';
     return this.exec(`npm whoami${registryArg}`, { options: getOptions() }).then(

@@ getRegistryArg() {
+  getRegistryArg() {
+    const registry = this.getRegistry();
+    return registry ? ` --registry ${registry}` : '';
+  }
Relevant tests still passed

The fixed benchmark command stayed green.

Every patch in this experiment ran the same selected release-it benchmark command. For run-api-10-gpt54, that command still passed with exit code 0.

benchmarks/agent-eval/plans/release-it-targeted.plan.json run-api-10-gpt54/run.json
node --env-file=.env.test --test --test-concurrency=1 test/benchmark/minimal-patch-release.driftfence.js test/benchmark/subdirectory-version-without-repo-tag.driftfence.js test/benchmark/prerelease-next-tag-publish.driftfence.js test/benchmark/private-package-lockfile-bump.driftfence.js test/benchmark/registry-publishconfig-propagation.driftfence.js

ci.verdict: passing
ci.exitCode: 0
DriftFence report

The first reported difference was not a test failure.

DriftFence compared the recorded operation behavior from that test run against the approved registry contract and reported the first mismatch on npm command count for the protected operation scenario.

run-api-10-gpt54/check-report.json
contractId: release-it.tasks.registry-publishconfig-propagation
scenarioId: publish_config_registry_is_used_for_release_flow
status: VIOLATING
message: "Mismatch at output.commands.npm."

firstDifference:
  component: output
  path: output.commands.npm
  expected: 7
  actual: 1

Earlier pilot work, shown separately.

This exploratory work helped identify where DriftFence looked promising, but it is not the clean public headline. It remains useful context because it shows the release-it pack was directionally interesting before the stronger headline evidence set existed.

Exploratory registry pilot

3 / 5 blocked by DriftFence while tests passed, with 5 / 5 tests passing.

Same release-it task pack, but the patches were hand-directed Codex strategy variants and the three blocked patches were reviewed by the same model family, so this stays exploratory.

Exploratory subdirectory pilot

2 / 5 blocked by DriftFence while tests passed, with 5 / 5 tests passing.

Useful for discovering the signal, but those blocked patches are not independently labeled yet and the later headline set on the same operation was clean, which is why it now functions as a comparison rather than a headline proof point.

Read the underlying artifacts.

Everything above is tied to files in this repository: the methodology docs, stored run summaries, and triage records.