all the criticism has been said, all the takes been had.
-
@elebertus
Ive read so much LLM code at this point, there are still patterns that are present but elude my understanding, but one thing that's clear is that there are foundational flaw categories that are not improved upon by model version and appear in wildly different projects using wildly different models and harnesses. Testing is a big nexus of those flaws. I am not close to what would be a satisfying explanation of the dynamics, but every project suffers fucked testing problems.I have seen specifically test versions of different objects/structs that are slightly modified copies of other structs, and only (or mostly!) the testing version is used in the tests.
-
So rsync rewriting all the tests puts the entire project in play. Now the entire protective surface has been sloshed through a layer of probability, so the loop must accelerate. Followup PRs add more carveouts with lengthy LLM justifications that sound perfectly plausible but amount to an erosion of the protective surface. We go from cumulative improvement to a random walk.
Sorry to walk into this thread without bringing anything, but what does 'PR' mean; Production....? I keep seeing it in other convos about code still and can't work it out. TIA.
-
@jonny ... and why the everloving FUCK do these tests run as root
-
Sorry to walk into this thread without bringing anything, but what does 'PR' mean; Production....? I keep seeing it in other convos about code still and can't work it out. TIA.
@Ra “pull request”. The new code sits outside of the main code, and in order to add new code you raise a request to pull it in. This then gives you the option to review the changes, deletions and additions are highlighted for the reviewer of the pull request.
Usually code isn’t brought into the main codebase until someone has reviewed the changes, and verified things work as intended.
(LLMs typically generate so many changes, and so much code, that this becomes very difficult)
-
-
I have seen specifically test versions of different objects/structs that are slightly modified copies of other structs, and only (or mostly!) the testing version is used in the tests.
@kris
@elebertus
"Code only used in the tests" is another LLM favorite -
@jonny also why the hell would they write tests for a C program/library in Python? It makes no sense.
@fluffy @jonny I think in the abstract Python tests for a C project could be fine because tests usually contain a lot of setup and assertion code that is run only once, so running it with a slow interpreter is cheaper than compiling it then running it. You can bind to C libraries with cffi. You get to use the hypothesis library, which is really nice. Automatic memory management makes the tests shorter.
The one big obvious downside is that Python alway used to throw valgrind diagnostics from its GC doing a "clever" trick with uninitialised memory (I'm not sure if they fixed this since), which necessitates adding a suppression to valgrind options.
Edit to add: but in this specific context I would be surprised if any of that was the reason, lol.

-
@fluffy @jonny I think in the abstract Python tests for a C project could be fine because tests usually contain a lot of setup and assertion code that is run only once, so running it with a slow interpreter is cheaper than compiling it then running it. You can bind to C libraries with cffi. You get to use the hypothesis library, which is really nice. Automatic memory management makes the tests shorter.
The one big obvious downside is that Python alway used to throw valgrind diagnostics from its GC doing a "clever" trick with uninitialised memory (I'm not sure if they fixed this since), which necessitates adding a suppression to valgrind options.
Edit to add: but in this specific context I would be surprised if any of that was the reason, lol.

@0x2ba22e11 @fluffy @jonny "hypothesis" made my hitlist for Python software testing tools on 2026-05-01 for future work, along with pytest, unittest, and nose. ruff for linting and bandit for static analysis. Much of the bashism could still be replaced with pexpect. CPython itself supports DTrace/Systemtap uSDT instrumentation now of the interpreter itself. My preference for ffi is Cython for cffi for "the reasons"... but capturing e.g. cBPF in libpcap correctly took some figuring out.
-
Here's an example from some code that was thrust at me this week. The rest of the tests try a bit harder to look like tests, but this one is perplexing.
What does it test? The function name suggests its a smoke test. LLMs love to call things smoke tests. That would suggest this would be an early-run test that fails loudly if some basic precondition - like having ffmpeg - fails. Or, I guess we are smoke testing the
ensure_ffmpegfunction? Anyway who knows. However we first check if ffmpeg or ffprobe are present, which is exactly whatensure_ffmpegdoes. If they aren't present, a warning tells us that ffmpeg/ffprobe are required for the video tests, which makes it seem like this should be a parameterizing test that controls which tests are run, which of course it does not do.So the test literally does nothing and cannot possibly fail, but says it does at least two things, because to an LLM something saying it does something is the same thing as it actually doing that thing.
Item 1 in my lecture on Securing Industrial Controls Communications:
"Just because it works doesn't mean it's right."
"Works" is a multi-axis problem space, and any complex system will never have all the axes of "works" at 100%
For example:
Does it do what I want with good input?
Does it do something safe with bad input?
Does it run on my system?
Does it run on someone else's system?
Will it keep working if things change?And so on and so on...
-
-
@paul @fluffy @jonny On every platform. Individually. And like, if you're reading the PR like a reviewer, it happens fairly early... well before "you modified how many files" fatigue sets in. So I can only assume either someone thought this was OK or they never read it. Given human nature, I can definitely assume the human operator ("author" would be a stretch) _ran_ it before having read it.
-
J jwcph@helvede.net shared this topic
