The test that proved nothing: when a green check is the most dangerous kind

Process 1 June 2026 8 min read Tech & tools Updated 22 July 2026

The Long Watch lives or dies on one promise: the same world, played the same way, unfolds the same way every time. We lean hard on automated checks to hold ourselves to it. So the failure that scares us isn’t a check that turns red. A red check is doing its job.

It’s a check that turns green while actually verifying nothing. That kind doesn’t just fail to help; it manufactures confidence you haven’t earned. What follows is one such check, the bug it sailed straight past, and the disciplines we wrote down afterward so it couldn’t fool us the same way twice.

The setting was the return-to-the-earth half of the world’s decay cycle: a creature dies, its body lingers, breaks down over a few in-game days, and feeds nutrients back into the soil beneath it. How that homecoming feels in the world is the subject of return to the earth. This post is about the test we wrote to guard the math underneath it, and how that test stopped guarding anything.

The check that zeroed out the thing it watched

The pace of decay isn’t constant. It answers to the warmth of the soil, which itself shifts with the seasons: winter slows decomposition, summer hurries it along. To make sure that math never drifted silently from under us, we wrote a check that re-runs the decay over a fixed world and asserts the same result every time. The intent was good. The execution had a quiet flaw.

A stretch of voxel ground at golden hour, frost-touched and pale on one side, warm and green with growth on the other. — The same ground, cold and warm: the warmth the test had quietly switched off.

To keep the check simple and reproducible, we set its temperature influence to zero. One line, well-meant: with the seasonal warmth held flat, the numbers are tidy and the run is easy to reason about. But temperature was the only dimension the check existed to protect. Switching it off didn’t simplify the test. It gutted it. The check was now structurally blind to the one thing it was supposed to see.

A check with one of its inputs switched off can only ever pass. Weakness would be easier to spot; this is confidence with nothing behind it.

And there was a real bug to miss

Separately, and this is what turns a tidy mistake into a near-disaster, the decay code had an actual defect. It was reading the wrong point in the seasonal cycle. The neighbouring plant-growth system applies a small phase offset when it asks the calendar what season it is, and the new decay code was supposed to mirror that. It didn’t. It omitted the offset and read the season a beat out of step.

Here is the trap closing. That bug could only ever express itself through the temperature dimension: reading the season wrong only matters because the season sets the warmth. And the check had zeroed temperature out. So the decay code could have been completely, obviously broken, and the check would still have gone green every single time. The bug and the blind spot lined up perfectly: the test was incapable of failing for exactly the reason the code was wrong.

Careful reading caught it, not the checks

No automated net found this. We did, by reading the code closely. The decay code carried a comment claiming it mirrored an existing, trusted routine. Rather than take the comment at its word, we pulled up both pieces and compared them line by line, and found they diverged exactly where the comment promised they matched. The label said “same as the trusted version.” The behaviour wasn’t. We fixed the decay code to match its cited precedent exactly, the phase offset and all, and the loop closed: a body appears, scavengers come, decomposition completes, the soil is visibly enriched, at the right pace, in the right season.

A small still creature lying on dark, rich voxel forest soil in golden light, with tiny green sprouts pushing up around it. — The loop closed: a body returns to the earth, and the soil it feeds grows richer.

The same lesson, three times

What stung was that this wasn’t the first time a green check had proved nothing during this stretch of work. It was the third. Each wore a different disguise, and together they taught us the shape of the whole family.

The first was a check that confirmed a piece of new code was wired in (registered, present, accounted for) but never confirmed it could actually load and run. A typo slipped through underneath it, because the quick check skipped the very step that would have rejected the typo.

The second was sharper. While building the creatures’ ability to die, we wrote a determinism check for that headline feature, and then filed it in a way that meant the test runner never actually ran it during a normal pass. The runner walks a master list to decide what to run; the tag we’d filed the check under only decides what to skip. The check belonged to no list the runner executed. It ran in no mode at all, while every run still came back green. The headline check for the headline feature proved nothing, and we only noticed because adding it back to the executed list meant the run now had more to verify than before. The proof it had never run was the work that appeared the moment we made it run.

The third was the corpse-decay check above: it ran, but with a dimension switched off, blind by construction.

Registered but unloadable. Filed but unrun. Run but blind. Three faces of one failure: a green checkmark is only as trustworthy as your confidence that the check behind it actually exercises the thing it watches.

All three at least turned green — a false signal, but a signal. There is a meaner sibling that makes no sound at all: a failure that produces no signal whatsoever, not even a misleading green to argue with.

The family kept growing after this. A fourth face turned up in July, in a canopy check whose scenario had been cloned from production, sterile pairing and all, so it went green by faithfully reproducing the bug it existed to catch. The longest-running instance we have found yet held its green zero for two months.

What we wrote down

The instinct after a near-miss like this is to resolve to be more careful. We’ve learned to distrust that instinct — “be more careful” is a wish, not a mechanism, and it fails the moment attention lapses, which it always eventually does. The fix has to live in the gate, not in the good intentions of whoever’s at the keyboard. So each of these turned into a standing rule.

When code claims to mirror a trusted piece, verify the claim against what that piece actually does. A comment that says “same as the existing routine” is a hypothesis, not a fact. The corpse-decay bug was a divergence hiding behind a label that promised a match; now the label gets checked against the behaviour, not believed.
A green result is suspect, not trusted, when the check has a dimension turned off. Zeroing an input to simplify a test is sometimes the right call — but it’s a warning sign on the result, never a clean pass, because the test is now provably unable to fail along the axis it muted.
Close the runner’s blind spot at the source. The unrun determinism check became a startup self-check: the runner now refuses to begin unless every tagged test also appears in the list it actually executes, and it fails loudly, before running, rather than sailing to a vacuous green. A test that exists but never runs is now an error you can’t miss, not a silence you have to catch.

That last one matters most, because it’s the difference between a rule and a guard. A rule asks a person to remember. A guard remembers for them.

A test that can’t fail when the thing it watches breaks is worse than no test. No test is honest about the gap. A green one lies about it.

Why this is load-bearing for the whole game

None of this would matter if the simulation were a loose collection of effects. Instead it’s a web of interlocking loops. A creature dies; its body becomes a corpse; the corpse enriches the soil; richer soil grows thicker plants; thicker plants feed more creatures; and around again. Each loop is guarded by checks like the ones above. If those guards are hollow, green but blind, the determinism that makes a saved world trustworthy can rot from the inside while every light on the dashboard stays comfortably lit.

So we treat our tests less like a safety net and more like an immune system: its job isn’t to feel reassuring, it’s to actually catch the thing that’s wrong. A net you can’t see through the holes of is just a blanket. The corpse-decay check taught us to keep asking the only question that matters about a green run — not “did it pass?” but “could it have failed?” If the answer is no, the green means nothing, and the work isn’t done. That question is machinery now rather than a habit: we break our own code on purpose and write down which checks noticed.