all the criticism has been said, all the takes been had.

david_chisnall@infosec.exchange

I suspect testing has the same properties as translation. It’s moderately easy to build machine-translation systems that are kind-of okay. A mechanical dictionary is a reasonable approximation. If something goes through your post, looks up each word in an English-French dictionary (for example) and outputs the resulting text, it won’t be correct, but it will be vaguely comprehensible. If you build a dictionary of bigrams or trigrams (sequences of 2-3 words) this gets a bit better because now collocations are more likely to be translated correctly. It won’t be as good as a professional translator, but it will more or less look like the target language. Add more statistical modelling and you will get better up to a point. But there’s a cliff where you can’t improve without actually understanding the content. No amount of statistical modelling will let you accurately translate the things that are statistical outliers and the extrinsic knowledge necessary means that you can’t infer a correct translation from the text alone without understanding its context.

Tests have a similar property. Good tests convey the intention, but the intention is not part of the code and so can’t be inferred from it. Good tests cover the things that the test author knows are corner cases, but these can’t be inferred from the code either (a few can, if the language has explicit error-handling constructs) because they’re a property of the input data.

In both cases, LLMs try to compensate for the lack of understanding by having a lot of examples of similar things in their input. If the thing you’re translating is similar to a load of other things, you may not need to understand it to translate it correctly because the first dozen (or hundred, thousand, or whatever scale you need) people to translate something like that did the hard work and you can reuse it. If the thing you’re testing is similar to a load of other things that already exist, someone else may have done the hard work of identifying the common failure modes and expressing intent.

But commonly LLM-generated tests end up testing that the code does what the code does. And that’s not useful. If you want that, just use fuzzing in a harness that tests trace equivalence between two versions of the program (for the same sequence of inputs, do they generate the same output?). That is useful for no-functionality-change-intended patches (typically things that improve performance or simplify unnecessary complexity), but most changes to the codebase are there because you want the behaviour to change. Good tests will fail if you changed something that was part of an API contract but will not fail if you added new behaviour, but tests based on the code will change.

This isn’t limited to LLMs. Some of the LLVM tests are just ‘run this command, the output should look like this’. People typically reject these in review now because long and painful experience showed us that it was hard to refactor when a change broke a test and the change author couldn’t tell if the difference in output came from something we actually cared about or just something that happened to be part of the old version’s output. But humans can, at least, tell the difference in the tests because they understand what it is that they intend with the change that introduces the test.

jonny@neuromatch.social

@davey_cakes
@ainmosni
Exactly the same, I recognize that feeling too

jonny@neuromatch.social

@david_chisnall
@elebertus
Well said

gklyne@indieweb.social

@jonny “to an LLM something saying it does something is the same thing as it actually doing that thing” bears pondering

mk30@tilde.zone

@jonny I agree that it's gambling addiction.

jdp23@neuromatch.social

great analysis! @david_chisnall @jonny @elebertus

bms48@mastodon.social

@poleguy @jonny rsync is the fallback method for zxfer, a FreeBSD script-based tool for replicating ZFS datasets over the network; normally it calls out to zfs send/recv to do the hard work of shuffling deltas. I don't use it myself; I'm currently using a 3rd party backup agent with SFTP backing for Win11 clients but not rsync itself.

bms48@mastodon.social

@jonny I suspect the "information gap" you mention here corresponds to known LLM kryptonite: they can't emulate human reasoning in the abductive mode (to best outcome or expectation, usually inherently non-boolean, and the domain of GOFAI expert systems). Human decisions about refactoring legacy code bases (Chesterton's Fence) might constitute an example of problem solving requiring abductive reasoning. I published https://burdentennis.com yesterday but didn't put my name to it directly.

bms48@mastodon.social

@jonny And I found a solid refutation of Cartesian dualism just now from the Wikipedia article for category error: "The Concept of Mind" by Gilbert Ryle. That's aiming at Dawkins' contention Claude is conscious (very unlikely). @mattsheffield was driving at this. Humans can be really stupid as Carlo Cipolla points out. Even when LLM boosters ignoring Searle get shut down by Hitchens' razor the burden tennis recurs. But empirical refutation is starting to emerge as the con is about to be closed.

kris@todon.eu

@jonny @elebertus

I have seen specifically test versions of different objects/structs that are slightly modified copies of other structs, and only (or mostly!) the testing version is used in the tests.

ra@mstdn.social

Sorry to walk into this thread without bringing anything, but what does 'PR' mean; Production....? I keep seeing it in other convos about code still and can't work it out. TIA.

d_rift@beige.party

@fluffy @jonny .. they WHAT?!

lightbeaminsight@mastodon.social

@Ra “pull request”. The new code sits outside of the main code, and in order to add new code you raise a request to pull it in. This then gives you the option to review the changes, deletions and additions are highlighted for the reviewer of the pull request.

Usually code isn’t brought into the main codebase until someone has reviewed the changes, and verified things work as intended.

(LLMs typically generate so many changes, and so much code, that this becomes very difficult)

d_rift@beige.party

@fluffy @jonny omfgs. They did.

jonny@neuromatch.social

@kris
@elebertus
"Code only used in the tests" is another LLM favorite

0x2ba22e11@unstable.systems

@fluffy @jonny I think in the abstract Python tests for a C project could be fine because tests usually contain a lot of setup and assertion code that is run only once, so running it with a slow interpreter is cheaper than compiling it then running it. You can bind to C libraries with cffi. You get to use the hypothesis library, which is really nice. Automatic memory management makes the tests shorter.

The one big obvious downside is that Python alway used to throw valgrind diagnostics from its GC doing a "clever" trick with uninitialised memory (I'm not sure if they fixed this since), which necessitates adding a suppression to valgrind options.

Edit to add: but in this specific context I would be surprised if any of that was the reason, lol.

bms48@mastodon.social

@0x2ba22e11 @fluffy @jonny "hypothesis" made my hitlist for Python software testing tools on 2026-05-01 for future work, along with pytest, unittest, and nose. ruff for linting and bandit for static analysis. Much of the bashism could still be replaced with pexpect. CPython itself supports DTrace/Systemtap uSDT instrumentation now of the interpreter itself. My preference for ffi is Cython for cffi for "the reasons"... but capturing e.g. cBPF in libpcap correctly took some figuring out.

philsalkie@mindly.social

@jonny

Item 1 in my lecture on Securing Industrial Controls Communications:

"Just because it works doesn't mean it's right."

"Works" is a multi-axis problem space, and any complex system will never have all the axes of "works" at 100%

For example:

Does it do what I want with good input?
Does it do something safe with bad input?
Does it run on my system?
Does it run on someone else's system?
Will it keep working if things change?

And so on and so on...

paul@notnull.space

@d_rift @fluffy @jonny no fucking way

d_rift@beige.party

@paul @fluffy @jonny On every platform. Individually. And like, if you're reading the PR like a reviewer, it happens fairly early... well before "you modified how many files" fatigue sets in. So I can only assume either someone thought this was OK or they never read it. Given human nature, I can definitely assume the human operator ("author" would be a stretch) _ran_ it before having read it.