all the criticism has been said, all the takes been had.

sesamzoo@mastodon.social

@jens @jonny At work "everyone has to use AI" according to C level people, despite all examples of things going wrong or at best "only" wasting lots of resources. Full hype mode.

LLMs are a tool. I just haven't figured out what for they are a reliable and reasonable choice.

For "How could problem X be addressed?", f.i. with a language I know very well, it might generate an answer with good points for me to look up and verify by myself in detail. Like a shortcut for a bunch of search queries.

bms48@mastodon.social

@jonny Ugh. I didn't realize at first which project this was. Then I looked at the repo. Yes, the road to hell IS paved with good intentions...

pandabutter@plush.city

@jonny It's a pattern I've been noticing all over.
Step 1: a process is created to measure something, like "does the software work right?" or "who do people want to be president?"
Step 2: There's an incentive for the people who perform and maintain the process to get a certain outcome, like good performance reviews or the guy you like being elected.
Step 3: Lacking the power to alter the thing being measured, the people in charge get creative with how they measure.

davey_cakes@mastodon.ie

@ainmosni @jonny

I'm kinda similar, the most I tried gambling was when I went to a gambling town for a wedding. I came out up but it just didn't do much for me.

Except poker, maybe because that involved people. That was fun.

pandabutter@plush.city

@jonny It's those darn management cybernetics all over again. In order to make good software, we built a system of writing and passing tests—and (say it with me) the purpose of a system is what it does. The starting assumption that "did we make good software" would always be a critical test turned out to be false, because no system was constructed to keep it true.

themipper@mastodon.social

@JetSetIlly @jonny damn.

bms48@mastodon.social

@jonny Reminded of the "similarity quagmire" by changes Netflix guys are staging for FreeBSD TCP logs. My engineering notes from Feb, they did 80-90% of what I had in mind to do. The LLMs kept confabulating and conflating the meaning of "delayed_ack" with delacktime. "delayed_ack" is a variant ; its meaning is context dependent for macOS or between BSDs. it is neither a timer delta, nor is it an outstanding segment threshold counter; it can be all of those things. The LLMs did not infer this.

bms48@mastodon.social

@jonny The really bizarre thing is, that even if LLMs were capable of actual reasoning, they'd be considered to be committing category error in how they parse source code; logical fallacies then follow from that. Deep learning proponents also appear to have been doing this. The terminology describing LLMs and deep learning invites category error and cognitive dissonance. Geoffrey Hinton has a lot to answer for with his "It might be conscious" emergence-based woo.

synlogic4242@social.vivaldi.net

@jonny nailed it

jonny@neuromatch.social

@henryk
The smoke test for syntactic correctness is, thankfully, the interpreter.

europlus@social.europlus.zone

@bstacey @jonny and Angel Dust…don’t forget the Angel Dust.

davey_cakes@mastodon.ie

@ainmosni @jonny

Another thing it reminds me of, with tokens unlocking after x time so you better be ready to jump and start getting things done again, is my aunties Farmville addiction.

david_chisnall@infosec.exchange

@jonny @elebertus

I suspect testing has the same properties as translation. It’s moderately easy to build machine-translation systems that are kind-of okay. A mechanical dictionary is a reasonable approximation. If something goes through your post, looks up each word in an English-French dictionary (for example) and outputs the resulting text, it won’t be correct, but it will be vaguely comprehensible. If you build a dictionary of bigrams or trigrams (sequences of 2-3 words) this gets a bit better because now collocations are more likely to be translated correctly. It won’t be as good as a professional translator, but it will more or less look like the target language. Add more statistical modelling and you will get better up to a point. But there’s a cliff where you can’t improve without actually understanding the content. No amount of statistical modelling will let you accurately translate the things that are statistical outliers and the extrinsic knowledge necessary means that you can’t infer a correct translation from the text alone without understanding its context.

Tests have a similar property. Good tests convey the intention, but the intention is not part of the code and so can’t be inferred from it. Good tests cover the things that the test author knows are corner cases, but these can’t be inferred from the code either (a few can, if the language has explicit error-handling constructs) because they’re a property of the input data.

In both cases, LLMs try to compensate for the lack of understanding by having a lot of examples of similar things in their input. If the thing you’re translating is similar to a load of other things, you may not need to understand it to translate it correctly because the first dozen (or hundred, thousand, or whatever scale you need) people to translate something like that did the hard work and you can reuse it. If the thing you’re testing is similar to a load of other things that already exist, someone else may have done the hard work of identifying the common failure modes and expressing intent.

But commonly LLM-generated tests end up testing that the code does what the code does. And that’s not useful. If you want that, just use fuzzing in a harness that tests trace equivalence between two versions of the program (for the same sequence of inputs, do they generate the same output?). That is useful for no-functionality-change-intended patches (typically things that improve performance or simplify unnecessary complexity), but most changes to the codebase are there because you want the behaviour to change. Good tests will fail if you changed something that was part of an API contract but will not fail if you added new behaviour, but tests based on the code will change.

This isn’t limited to LLMs. Some of the LLVM tests are just ‘run this command, the output should look like this’. People typically reject these in review now because long and painful experience showed us that it was hard to refactor when a change broke a test and the change author couldn’t tell if the difference in output came from something we actually cared about or just something that happened to be part of the old version’s output. But humans can, at least, tell the difference in the tests because they understand what it is that they intend with the change that introduces the test.

jonny@neuromatch.social

@davey_cakes
@ainmosni
Exactly the same, I recognize that feeling too

jonny@neuromatch.social

@david_chisnall
@elebertus
Well said

gklyne@indieweb.social

@jonny “to an LLM something saying it does something is the same thing as it actually doing that thing” bears pondering

mk30@tilde.zone

@jonny I agree that it's gambling addiction.

jdp23@neuromatch.social

great analysis! @david_chisnall @jonny @elebertus

bms48@mastodon.social

@poleguy @jonny rsync is the fallback method for zxfer, a FreeBSD script-based tool for replicating ZFS datasets over the network; normally it calls out to zfs send/recv to do the hard work of shuffling deltas. I don't use it myself; I'm currently using a 3rd party backup agent with SFTP backing for Win11 clients but not rsync itself.

bms48@mastodon.social

@jonny I suspect the "information gap" you mention here corresponds to known LLM kryptonite: they can't emulate human reasoning in the abductive mode (to best outcome or expectation, usually inherently non-boolean, and the domain of GOFAI expert systems). Human decisions about refactoring legacy code bases (Chesterton's Fence) might constitute an example of problem solving requiring abductive reasoning. I published https://burdentennis.com yesterday but didn't put my name to it directly.