all the criticism has been said, all the takes been had.

bms48@mastodon.social

@jonny I have seen this pattern of which you speak when attempting to use LLMs to compare TCP Delayed-ACK implementations between BSD derived code bases. They generated output suggesting semantics that just weren't there, presumably based on how similarly named things were between each fork, but this was not obvious without reading the source for oneself in context. This went doubly for FreeBSD where there are multiple TCP functional blocks ("stacks").

jonny@neuromatch.social

To a person, the whole purpose of the test is for it to fail when it should. That's an elemental part of writing good tests: they must fail before the patch, or else they provide no protection. We want protection from failure, that is good for us. We need tests to protect us because we can't possibly evaluate all the other parts of a complex system when we try to fix one part of it.

LLM slot machines change what tests mean - of course we still want the code to work good, but if we're not evaluating the code or the tests, then what the slot machine turns them into is just a high score and the jackpot condition. 130 new tests added, that means its good. They pass, that means I win.

The bugfix loop with LLMs defeats the purpose of automated tests and renders it no better than manual testing: you notice a bug, you yell at the LLM to fix it, you keep looking at the specific thing that's broken until its fixed, good robot, ship it. The changes don't have meaningful tests, and nothing else does either, so the slot machine loop repeats, bug->fix->win. Very velocity. Rocket fuel even.

murodegrizeco@toad.social

@jonny

Huh, yeah, gambling works as a metaphor!

I was going more with ...

"This person came back from the alien ship, they had a grabby thing on their face but it fell off, and now the person is weirdly hungry, and maybe we should worry about what's growing inside them."

jonny@neuromatch.social

But its not just as simple as "OK if I read the tests I should be fine" because LLM code is often untestable. It writes code with function and class names that make it seem like a something does something, but they might just be flat wrong. Or there is some invisible fallback condition the LLM encountered while generating code and added to just make tests pass, but has entirely different behavior.

If you've watched an LLM generate a project over time, you see it generating its own private language, and ive even seen it reinvent language features like function definitions themselves. Its names form part of an increasingly inaccessible web of meaning that no human can penetrate.

Writing tests requires a kind of "information gap" where you can have enough intuition about what something does, but not how it does it, so you can a) know what it should do, b) make a strong assertion about that expectation, c) without mirroring the internal implementation's limits. That's hard! And really only possible when the foundation, (a) is true. Code must have an articulable purpose in order to be testable, that's tautological, that defines what failure is. But since LLM code increasingly detaches from any kind of stable description or expectation, even if the tests look very rigorous, you can't know if they are just tailored to the specific internal details of its function to eke out a pass, because it's hard to know what it should do anyway.

So really you have to read the test code, the code under test, and also all the other code that might call the code under test. Aka you have to read everything. And rather than reading something that was written to be read, you're wading through a slop swamp. So you can't. It takes more time than just writing it. The erosion of testing is just an intrinsic part of the loop that you can't escape without breaking the spell of the slot machine, and it is what drives the loop.

jonny@neuromatch.social

@elebertus
I think bun's rust rewrite is the single largest high profile yeet I have ever seen, if you haven't seen that yet

jonny@neuromatch.social

So rsync rewriting all the tests puts the entire project in play. Now the entire protective surface has been sloshed through a layer of probability, so the loop must accelerate. Followup PRs add more carveouts with lengthy LLM justifications that sound perfectly plausible but amount to an erosion of the protective surface. We go from cumulative improvement to a random walk.

jonny@neuromatch.social

@elebertus
Ive read so much LLM code at this point, there are still patterns that are present but elude my understanding, but one thing that's clear is that there are foundational flaw categories that are not improved upon by model version and appear in wildly different projects using wildly different models and harnesses. Testing is a big nexus of those flaws. I am not close to what would be a satisfying explanation of the dynamics, but every project suffers fucked testing problems.

eliocamp@mastodon.social

@jonny One thing I love about this and other posts linking to slop on GitHub is that more often than not I flat out can't follow the link because GitHub is not working.

bstacey@icosahedron.website

@jonny I struggle to express how bleak this is.

bstacey@icosahedron.website

@jonny It's like everyone decided to take a bath in mercury and leaded gasoline.

bipolaron@scholar.social

@bstacey @jonny with a plugged in datacenter

poleguy@mastodon.social

@jonny I just lost my beer league hockey championship as the last shooter on a 14 round shoot out. I'm sitting in my driveway reading your thread. I'll need to read it again in the morning.

I don't remember why I followed you originally. But I love this thread.

This whole rsync thing is the most interesting thing that has come out of the ai bubble.

I had a negative feel for rsync after years ago reading a blog criticizing its sloppy design.

Yet I rely on it daily. I have so many questions.

jonny@neuromatch.social

@poleguy
RIP on the shootout, hopefully the other team bought the beer and you got to pinch the other goalies cheek a bit. You'll get em next season

poleguy@mastodon.social

@jonny indeed, that's the right feeling!

We have sponsorship from a brewery, so the locker room beer (and custom jerseys) are "free."

But we sat at the bar with the other team. It is just a game after all.

Both sides had a good time. And we had fans cheering for both sides. And kids crashing the locker room to celebrate despite the loss... We shared our NA options. Can't ask for more.

I'd love to engage more on this thread technically... I have thoughts. Maybe Monday.

jens@social.finkhaeuser.de

@jonny related:

https://finkhaeuser.de/2026-04-10-outsourcing-thought-is-going-great/

jens@social.finkhaeuser.de

@jonny @ainmosni Gambling (addiction) works on the so-called Variable Reinforcement Schedule.

The TL;DR of it is, results are random enough that even though it seems there may be a pattern, there isn't. You're pulled in because "one more time will show my pattern detection was right".

And since human brains are excellent pattern detection machines, every time this succeeds yields huge dopamine rewards.

I'm pissed off with the pattern, which is why I stop. But I can't deny its power.

themipper@mastodon.social

@jonny this whole thing is so bad that the only viable way seems to fork it before the LLM sloppening. It is a shame to see more and more foundational projects fall into the LLM trap.

And as always you hit the nail on the head with your deep dive and explanations. I love reading them.

I will use your observation on how for a LLM what is written is the same as what is happening.

gunchleoc@mastodon.scot

@jonny Of course it's a smoke test - as in "smoke and mirrors"

WTAF.

fluffy@plush.city

@jonny also why the hell would they write tests for a C program/library in Python? It makes no sense.

spitfire@mastodon.de

@jonny holy crap this story gets worse by the day. Thank you very much for summing-up this aspect of the situation for a non-sw-engineering-person like me. 🫡