Skip to content
  • Hjem
  • Seneste
  • Etiketter
  • Populære
  • Verden
  • Bruger
  • Grupper
Temaer
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Kollaps
FARVEL BIG TECH
  1. Forside
  2. Ikke-kategoriseret
  3. all the criticism has been said, all the takes been had.

all the criticism has been said, all the takes been had.

Planlagt Fastgjort Låst Flyttet Ikke-kategoriseret
110 Indlæg 60 Posters 386 Visninger
  • Ældste til nyeste
  • Nyeste til ældste
  • Most Votes
Svar
  • Svar som emne
Login for at svare
Denne tråd er blevet slettet. Kun brugere med emne behandlings privilegier kan se den.
  • fluffy@plush.cityF fluffy@plush.city

    @jonny also why the hell would they write tests for a C program/library in Python? It makes no sense.

    fluffy@plush.cityF This user is from outside of this forum
    fluffy@plush.cityF This user is from outside of this forum
    fluffy@plush.city
    wrote sidst redigeret af
    #69

    @jonny ... and why the everloving FUCK do these tests run as root

    technocrow@blahaj.zoneT d_rift@beige.partyD 2 Replies Last reply
    0
    • jonny@neuromatch.socialJ jonny@neuromatch.social

      I think the modal situation here is that the people are reading none or very little of what is being generated by the LLM, so the tests have a special role: Tests function as the pull arm on the slot machine, you just generate until tests pass, and that's a jackpot. Obviously that's meaningless when the tests are meaningless, so tests take on a very different meaning and role in slot machine coding.

      Previously we would write careful test conditions that were based off some real problem or an understanding of what the code under test did, and had a specific thing they were intended to protect against. Tests move slow and are designed to protect us against the things we know can go wrong. When we learn of a new wrong thing, we add a test.

      LLM tests have the form of tests but don't do the same thing. They often test nothing, and are just expressions of truisms that the probabilistic text space explored while generating. They have strongly worded names but end up actually asserting that basic language features work as expected. Because it is not us writing tests for ourselves, where we only harm ourselves by making them weak, they function instead as a passively obfuscated justification for the code that the LLM generates. The user wants the tests to pass. The LLM provides.

      The tests are theater: they are the play field for the slot machine. They are mild, surmountable, need to fail a few times to be plausible, but must eventually pass within the expected generation loop window to deliver the payout.

      dahukanna@mastodon.socialD This user is from outside of this forum
      dahukanna@mastodon.socialD This user is from outside of this forum
      dahukanna@mastodon.social
      wrote sidst redigeret af
      #70

      @jonny Referencing
      1. @shauna post based on @DGI about power dynamics & dysfunction between imaginary labour(iML) & interpretive labour(iNL)-https://www.rethinkingpower.info/how-interpretive-labor-straddles-the-gap-between-rules-and-reality/
      2. Power, chapter 4 of Mary Parker Follet’s Dynamic administration - https://mastodon.social/@dahukanna/110643444784446704

      Presuming Productivity(P)=(iML/iNL)
      dysfunctional power-over tool imposition e.g. LLM, factory production,etc
      - Imagined abstract: 1 LLM PR/0 review units= ∞P
      - Interpreted reality: 1 LLM PR/>10 review units=0.1P
      -https://mastodon.social/@dahukanna/113230734549577353

      1 Reply Last reply
      0
      • fluffy@plush.cityF fluffy@plush.city

        @jonny ... and why the everloving FUCK do these tests run as root

        technocrow@blahaj.zoneT This user is from outside of this forum
        technocrow@blahaj.zoneT This user is from outside of this forum
        technocrow@blahaj.zone
        wrote sidst redigeret af
        #71

        @fluffy@plush.city @jonny@neuromatch.social running tests as root is fucking wild

        1 Reply Last reply
        0
        • jens@social.finkhaeuser.deJ jens@social.finkhaeuser.de

          @jonny related:

          https://finkhaeuser.de/2026-04-10-outsourcing-thought-is-going-great/

          sesamzoo@mastodon.socialS This user is from outside of this forum
          sesamzoo@mastodon.socialS This user is from outside of this forum
          sesamzoo@mastodon.social
          wrote sidst redigeret af
          #72

          @jens, great article, thank you. Did you pull the lever "just one more time" and if so, did it get even worse?

          @jonny, thank you for this thread and lots of your other threads on the topic.

          Both help feeling that I'm not the ghost driver although these days there is lot of contraflow on my lane. Mostly at work where the AI fanboys/believers/addicts are at least way louder than the people trying to understand and keeping their code in maintainable shape.

          jens@social.finkhaeuser.deJ 1 Reply Last reply
          0
          • sesamzoo@mastodon.socialS sesamzoo@mastodon.social

            @jens, great article, thank you. Did you pull the lever "just one more time" and if so, did it get even worse?

            @jonny, thank you for this thread and lots of your other threads on the topic.

            Both help feeling that I'm not the ghost driver although these days there is lot of contraflow on my lane. Mostly at work where the AI fanboys/believers/addicts are at least way louder than the people trying to understand and keeping their code in maintainable shape.

            jens@social.finkhaeuser.deJ This user is from outside of this forum
            jens@social.finkhaeuser.deJ This user is from outside of this forum
            jens@social.finkhaeuser.de
            wrote sidst redigeret af
            #73

            @sesamzoo @jonny No, I did not. I try to use LLMs not at all, so I really did maybe one or two more queries more than described, just to get a feel for it.

            jens@social.finkhaeuser.deJ 1 Reply Last reply
            0
            • jens@social.finkhaeuser.deJ jens@social.finkhaeuser.de

              @sesamzoo @jonny No, I did not. I try to use LLMs not at all, so I really did maybe one or two more queries more than described, just to get a feel for it.

              jens@social.finkhaeuser.deJ This user is from outside of this forum
              jens@social.finkhaeuser.deJ This user is from outside of this forum
              jens@social.finkhaeuser.de
              wrote sidst redigeret af
              #74

              @sesamzoo @jonny Also, I feel dirty every time, so I don't want to waste water crying in the shower afterwards.

              sesamzoo@mastodon.socialS 1 Reply Last reply
              0
              • themipper@mastodon.socialT themipper@mastodon.social

                @jonny this whole thing is so bad that the only viable way seems to fork it before the LLM sloppening. It is a shame to see more and more foundational projects fall into the LLM trap.

                And as always you hit the nail on the head with your deep dive and explanations. I love reading them.

                I will use your observation on how for a LLM what is written is the same as what is happening.

                jetsetilly@mastodon.gamedev.placeJ This user is from outside of this forum
                jetsetilly@mastodon.gamedev.placeJ This user is from outside of this forum
                jetsetilly@mastodon.gamedev.place
                wrote sidst redigeret af
                #75

                @themipper @jonny
                > It is a shame to see more and more foundational projects fall into the LLM trap

                The one that breaks my heart is vim.

                themipper@mastodon.socialT 1 Reply Last reply
                0
                • jonny@neuromatch.socialJ jonny@neuromatch.social

                  Here's an example from some code that was thrust at me this week. The rest of the tests try a bit harder to look like tests, but this one is perplexing.

                  What does it test? The function name suggests its a smoke test. LLMs love to call things smoke tests. That would suggest this would be an early-run test that fails loudly if some basic precondition - like having ffmpeg - fails. Or, I guess we are smoke testing the ensure_ffmpeg function? Anyway who knows. However we first check if ffmpeg or ffprobe are present, which is exactly what ensure_ffmpeg does. If they aren't present, a warning tells us that ffmpeg/ffprobe are required for the video tests, which makes it seem like this should be a parameterizing test that controls which tests are run, which of course it does not do.

                  So the test literally does nothing and cannot possibly fail, but says it does at least two things, because to an LLM something saying it does something is the same thing as it actually doing that thing.

                  henryk@chaos.socialH This user is from outside of this forum
                  henryk@chaos.socialH This user is from outside of this forum
                  henryk@chaos.social
                  wrote sidst redigeret af
                  #76

                  @jonny (Un)charitable interpretation: it smoke tests whether the ensure_ffmpeg function is syntactically correct — which is a failure mode LLMs are actually concerned about.

                  jonny@neuromatch.socialJ 1 Reply Last reply
                  0
                  • fluffy@plush.cityF fluffy@plush.city

                    @jonny also why the hell would they write tests for a C program/library in Python? It makes no sense.

                    jonny@neuromatch.socialJ This user is from outside of this forum
                    jonny@neuromatch.socialJ This user is from outside of this forum
                    jonny@neuromatch.social
                    wrote sidst redigeret af
                    #77

                    @fluffy
                    Apparently all the tests for rsync are integration tests across bash rsync calls

                    1 Reply Last reply
                    0
                    • ainmosni@social.ainmosni.euA ainmosni@social.ainmosni.eu

                      @KalenXI @jonny if you don't want to get lazy when using AI, is there really a use for it at all? I mean, it's been proven that reviewing code is much more difficult than writing it, so I'm finding it much more taxing to review slopcode than if I'd just write it myself.

                      Of course, that's adding the usual disclaimer that all this is not even relevant until the ethical and environmental shitshow of AI has been fixed.

                      R This user is from outside of this forum
                      R This user is from outside of this forum
                      robinadams@mathstodon.xyz
                      wrote sidst redigeret af
                      #78

                      @ainmosni @KalenXI @jonny With the caveat that the ethical problems with AI mean it's absolutely not worth the cost:

                      It looks like AIs are actually getting better at finding bugs and security holes.

                      So do your usual testing and code reviews, and then ask Claude to find any bugs you may have missed. It will give you some false positives but also some true ones.

                      Very different from having an LLM generate the code and a human try to fix it up.

                      Like Cory Doctorow's example: using an AI to give a second opinion on MRI scans (a centaur) makes scans more expensive but higher quality.

                      Having an AI analyse the scans at high speed and then getting some poor schmuck to try to spot its mistakes (reverse centaur) makes scans cheaper and lower quality, but at least there's a person with little power in the hierarchy who gets the blame for the problems.

                      Guess which one the people pouring trillions into AI want?

                      1 Reply Last reply
                      0
                      • jens@social.finkhaeuser.deJ jens@social.finkhaeuser.de

                        @sesamzoo @jonny Also, I feel dirty every time, so I don't want to waste water crying in the shower afterwards.

                        sesamzoo@mastodon.socialS This user is from outside of this forum
                        sesamzoo@mastodon.socialS This user is from outside of this forum
                        sesamzoo@mastodon.social
                        wrote sidst redigeret af
                        #79

                        @jens @jonny At work "everyone has to use AI" according to C level people, despite all examples of things going wrong or at best "only" wasting lots of resources. Full hype mode.

                        LLMs are a tool. I just haven't figured out what for they are a reliable and reasonable choice.

                        For "How could problem X be addressed?", f.i. with a language I know very well, it might generate an answer with good points for me to look up and verify by myself in detail. Like a shortcut for a bunch of search queries.

                        1 Reply Last reply
                        0
                        • jonny@neuromatch.socialJ jonny@neuromatch.social

                          So, look. One shot rewriting the whole test suite in another language is probably not great to do, but what happened here is so much worse than you are expecting.

                          https://github.com/RsyncProject/rsync/pull/903/

                          This does not "translate tests into pytest" or a unit testing framework, it writes its own testing framework where tests are whole python scripts that redefine basic test functions in every script. Surely there would be a single way to "run rsync and get the results" - nope, well, there is, but then every test file will randomly redefine its own _run_and_capture function. So like now rsync needs a test suite for its test suite.

                          If instead of telling an LLM to "rewrite the tests in python" you just searched "python testing" you would find the pytest docs. And then you would find examples. And then you could write fixtures to deduplicate all the prior shell script setup and teardown stuff, and so on. But since it was just "rewrite the tests in python" its now worse than before, and the odds of the rewrite actually being a 100% faithful translation are close to 0.

                          bms48@mastodon.socialB This user is from outside of this forum
                          bms48@mastodon.socialB This user is from outside of this forum
                          bms48@mastodon.social
                          wrote sidst redigeret af
                          #80

                          @jonny Ugh. I didn't realize at first which project this was. Then I looked at the repo. Yes, the road to hell IS paved with good intentions...

                          1 Reply Last reply
                          0
                          • jonny@neuromatch.socialJ jonny@neuromatch.social

                            To a person, the whole purpose of the test is for it to fail when it should. That's an elemental part of writing good tests: they must fail before the patch, or else they provide no protection. We want protection from failure, that is good for us. We need tests to protect us because we can't possibly evaluate all the other parts of a complex system when we try to fix one part of it.

                            LLM slot machines change what tests mean - of course we still want the code to work good, but if we're not evaluating the code or the tests, then what the slot machine turns them into is just a high score and the jackpot condition. 130 new tests added, that means its good. They pass, that means I win.

                            The bugfix loop with LLMs defeats the purpose of automated tests and renders it no better than manual testing: you notice a bug, you yell at the LLM to fix it, you keep looking at the specific thing that's broken until its fixed, good robot, ship it. The changes don't have meaningful tests, and nothing else does either, so the slot machine loop repeats, bug->fix->win. Very velocity. Rocket fuel even.

                            pandabutter@plush.cityP This user is from outside of this forum
                            pandabutter@plush.cityP This user is from outside of this forum
                            pandabutter@plush.city
                            wrote sidst redigeret af
                            #81

                            @jonny It's a pattern I've been noticing all over.
                            Step 1: a process is created to measure something, like "does the software work right?" or "who do people want to be president?"
                            Step 2: There's an incentive for the people who perform and maintain the process to get a certain outcome, like good performance reviews or the guy you like being elected.
                            Step 3: Lacking the power to alter the thing being measured, the people in charge get creative with how they measure.

                            pandabutter@plush.cityP 1 Reply Last reply
                            0
                            • ainmosni@social.ainmosni.euA ainmosni@social.ainmosni.eu

                              @jonny Gambling doesn't attract me at all, never has, wonder if that's why I've been more resistant to it than others...

                              davey_cakes@mastodon.ieD This user is from outside of this forum
                              davey_cakes@mastodon.ieD This user is from outside of this forum
                              davey_cakes@mastodon.ie
                              wrote sidst redigeret af
                              #82

                              @ainmosni @jonny

                              I'm kinda similar, the most I tried gambling was when I went to a gambling town for a wedding. I came out up but it just didn't do much for me.

                              Except poker, maybe because that involved people. That was fun.

                              davey_cakes@mastodon.ieD 1 Reply Last reply
                              0
                              • pandabutter@plush.cityP pandabutter@plush.city

                                @jonny It's a pattern I've been noticing all over.
                                Step 1: a process is created to measure something, like "does the software work right?" or "who do people want to be president?"
                                Step 2: There's an incentive for the people who perform and maintain the process to get a certain outcome, like good performance reviews or the guy you like being elected.
                                Step 3: Lacking the power to alter the thing being measured, the people in charge get creative with how they measure.

                                pandabutter@plush.cityP This user is from outside of this forum
                                pandabutter@plush.cityP This user is from outside of this forum
                                pandabutter@plush.city
                                wrote sidst redigeret af
                                #83

                                @jonny It's those darn management cybernetics all over again. In order to make good software, we built a system of writing and passing tests—and (say it with me) the purpose of a system is what it does. The starting assumption that "did we make good software" would always be a critical test turned out to be false, because no system was constructed to keep it true.

                                1 Reply Last reply
                                0
                                • jetsetilly@mastodon.gamedev.placeJ jetsetilly@mastodon.gamedev.place

                                  @themipper @jonny
                                  > It is a shame to see more and more foundational projects fall into the LLM trap

                                  The one that breaks my heart is vim.

                                  themipper@mastodon.socialT This user is from outside of this forum
                                  themipper@mastodon.socialT This user is from outside of this forum
                                  themipper@mastodon.social
                                  wrote sidst redigeret af
                                  #84

                                  @JetSetIlly @jonny damn.

                                  1 Reply Last reply
                                  0
                                  • bms48@mastodon.socialB bms48@mastodon.social

                                    @jonny I have seen this pattern of which you speak when attempting to use LLMs to compare TCP Delayed-ACK implementations between BSD derived code bases. They generated output suggesting semantics that just weren't there, presumably based on how similarly named things were between each fork, but this was not obvious without reading the source for oneself in context. This went doubly for FreeBSD where there are multiple TCP functional blocks ("stacks").

                                    bms48@mastodon.socialB This user is from outside of this forum
                                    bms48@mastodon.socialB This user is from outside of this forum
                                    bms48@mastodon.social
                                    wrote sidst redigeret af
                                    #85

                                    @jonny Reminded of the "similarity quagmire" by changes Netflix guys are staging for FreeBSD TCP logs. My engineering notes from Feb, they did 80-90% of what I had in mind to do. The LLMs kept confabulating and conflating the meaning of "delayed_ack" with delacktime. "delayed_ack" is a variant ; its meaning is context dependent for macOS or between BSDs. it is neither a timer delta, nor is it an outstanding segment threshold counter; it can be all of those things. The LLMs did not infer this.

                                    bms48@mastodon.socialB 1 Reply Last reply
                                    0
                                    • bms48@mastodon.socialB bms48@mastodon.social

                                      @jonny Reminded of the "similarity quagmire" by changes Netflix guys are staging for FreeBSD TCP logs. My engineering notes from Feb, they did 80-90% of what I had in mind to do. The LLMs kept confabulating and conflating the meaning of "delayed_ack" with delacktime. "delayed_ack" is a variant ; its meaning is context dependent for macOS or between BSDs. it is neither a timer delta, nor is it an outstanding segment threshold counter; it can be all of those things. The LLMs did not infer this.

                                      bms48@mastodon.socialB This user is from outside of this forum
                                      bms48@mastodon.socialB This user is from outside of this forum
                                      bms48@mastodon.social
                                      wrote sidst redigeret af
                                      #86

                                      @jonny The really bizarre thing is, that even if LLMs were capable of actual reasoning, they'd be considered to be committing category error in how they parse source code; logical fallacies then follow from that. Deep learning proponents also appear to have been doing this. The terminology describing LLMs and deep learning invites category error and cognitive dissonance. Geoffrey Hinton has a lot to answer for with his "It might be conscious" emergence-based woo.

                                      1 Reply Last reply
                                      0
                                      • jonny@neuromatch.socialJ jonny@neuromatch.social

                                        I think the modal situation here is that the people are reading none or very little of what is being generated by the LLM, so the tests have a special role: Tests function as the pull arm on the slot machine, you just generate until tests pass, and that's a jackpot. Obviously that's meaningless when the tests are meaningless, so tests take on a very different meaning and role in slot machine coding.

                                        Previously we would write careful test conditions that were based off some real problem or an understanding of what the code under test did, and had a specific thing they were intended to protect against. Tests move slow and are designed to protect us against the things we know can go wrong. When we learn of a new wrong thing, we add a test.

                                        LLM tests have the form of tests but don't do the same thing. They often test nothing, and are just expressions of truisms that the probabilistic text space explored while generating. They have strongly worded names but end up actually asserting that basic language features work as expected. Because it is not us writing tests for ourselves, where we only harm ourselves by making them weak, they function instead as a passively obfuscated justification for the code that the LLM generates. The user wants the tests to pass. The LLM provides.

                                        The tests are theater: they are the play field for the slot machine. They are mild, surmountable, need to fail a few times to be plausible, but must eventually pass within the expected generation loop window to deliver the payout.

                                        synlogic4242@social.vivaldi.netS This user is from outside of this forum
                                        synlogic4242@social.vivaldi.netS This user is from outside of this forum
                                        synlogic4242@social.vivaldi.net
                                        wrote sidst redigeret af
                                        #87

                                        @jonny nailed it

                                        1 Reply Last reply
                                        0
                                        • henryk@chaos.socialH henryk@chaos.social

                                          @jonny (Un)charitable interpretation: it smoke tests whether the ensure_ffmpeg function is syntactically correct — which is a failure mode LLMs are actually concerned about.

                                          jonny@neuromatch.socialJ This user is from outside of this forum
                                          jonny@neuromatch.socialJ This user is from outside of this forum
                                          jonny@neuromatch.social
                                          wrote sidst redigeret af
                                          #88

                                          @henryk
                                          The smoke test for syntactic correctness is, thankfully, the interpreter.

                                          1 Reply Last reply
                                          0
                                          Svar
                                          • Svar som emne
                                          Login for at svare
                                          • Ældste til nyeste
                                          • Nyeste til ældste
                                          • Most Votes


                                          • Log ind

                                          • Har du ikke en konto? Tilmeld

                                          • Login or register to search.
                                          Powered by NodeBB Contributors
                                          Graciously hosted by data.coop
                                          • First post
                                            Last post
                                          0
                                          • Hjem
                                          • Seneste
                                          • Etiketter
                                          • Populære
                                          • Verden
                                          • Bruger
                                          • Grupper