👀 … https://sfconservancy.org/blog/2026/apr/15/eternal-november-generative-ai-llm/ …my colleague Denver Gingerich writes: newcomers' extensive reliance on LLM-backed generative AI is comparable to the Eternal September onslaught to USENET in 1993.

redchrision@mastodon.social

Something in their policies like: you can use LLM-generated code however you please, but consider that it is trained on X, Y, Z, and it might not follow the policies of where you use it, and you are using it at your own responsibility, might help them out. But it is still on them if they train models on things they should not, and the LLMs further generate questionable code from a policy perspective.

sfoskett@techfieldday.net

@evan @cwebber @bkuhn @ossguy @richardfontana Ok I haven’t really heard people before you guys explain that to me. So I was wondering if it was possible that it couldn’t be licensed. Thanks.

evan@cosocial.ca

@richardfontana @cwebber @bkuhn @ossguy Yeah, I thought my job couldn't be automated, either, and yet here we are.

evan@cosocial.ca

@richardfontana @cwebber @bkuhn @ossguy Seriously, though, a lot of the work seems like it is tractable to LLM automation?

Like, the abstraction part seems like it's just summarizing components at the function, module, and program level. This is the command-line argument parser, this is the database abstraction layer, this is the logging module. LLMs are pretty good at this!

richardfontana@mastodon.social

@evan oh I mean of course you could use LLMs to help with the analysis @cwebber @bkuhn @ossguy

wwahammy@social.treehouse.systems

@ossguy @cwebber @LordCaramac @bkuhn @richardfontana proprietary software companies extensively use GitHub and yet SFC's position is "don't use GitHub".

There are so many things we do in free software and in the interactions with SFC and FSF that would be simpler if we used proprietary software. How many janky experiences have people been asking to tolerate to participate? Why shouldn't we use proprietary software there?

bkuhn@fedi.copyleft.org

@novalis
I agree with your supporting arguments but not the conclusion.

It goes back to the mutually assured destruction idea: no one in the for profit proprietary software industry is going to bring a lawsuit because they are so invested in LLM-backed AI succeeding.

That's where our commons differs widely from other creative works of expression.

I am worried about compulsory licensing for *training* —could be a disaster — but unrelated to output.

@zacchiro @cwebber @ossguy @richardfontana

richardfontana@mastodon.social

@sfoskett neither scotus nor afaik any other US court has held this. I would argue that it seems to be the direction the US legal system is going in, but I recently heard a persuasive counterargument from a well regarded US FOSS lawyer @evan @cwebber @bkuhn @ossguy

evan@cosocial.ca

For filtration, it seems like merger or scènes à faire would also be kind of automatable, maybe with human oversight. Is there a way to make a mailing daemon without a logging module? Maybe, but it's so common that everyone does it that way. Could you have a Person class without a getter and setter for the name? Probably not?

@richardfontana @cwebber @bkuhn @ossguy

evan@cosocial.ca

The comparison seems tough, but I'd put an LLM to the task. "How similar are the database abstraction layers in activitypub-bot and Fedify?" Again, I'd probably want some human review, but for that code stuff LLMs are pretty good.

@richardfontana @cwebber @bkuhn @ossguy

sfoskett@techfieldday.net

@richardfontana @evan @cwebber @bkuhn @ossguy I feel like it’s 3 questions for the court:
1 Can a non-human actor produce a copyrightable work? Likely no.
2 Is the human prompt and review enough to apply copyright to LLM content? Maybe?
3 Does this have implications for open source? I guess not.

evan@cosocial.ca

I consider myself an expert on this process since I learned about it 45 minutes ago, but it seems like AFC follows the hierarchical layers of modern programming-in-the-large -- statements, functions, modules, packages, program. That is the stuff that LLMs handle pretty well.

@richardfontana @cwebber @bkuhn @ossguy

bkuhn@fedi.copyleft.org

@evan

I actually think that these copyright concepts aren't particularly automatable.
Even if we try, it's pure arms race.

And the merger doctrine isn't the big problem here, it is the more complex analysis where merger doctrine clearly doesn't apply that needs analysis and I suspect the analysis is difficult to (even partially) automate.

But I'm looking into it.

Cf: chardet situation https://github.com/chardet/chardet/issues/355#issuecomment-4145369025

@richardfontana @cwebber @ossguy

? Offline

@bkuhn @zacchiro @cwebber @ossguy @richardfontana I don't even know if I agree with my supporting arguments. But I don't even think that it has to be someone in the proprietary world that brings a lawsuit -- it could be anyone whose code or text is trained on.

cwebber@social.coop

@bkuhn @evan @richardfontana @ossguy One thing I worry about is that the chardet rewrite might not generalize. The chardet maintainer used *more* care in the rewrite than most projects which have followed suit for laundering would. https://dan-blanchard.github.io/blog/chardet-rewrite-controversy/

Even then, it raises questions, because even the maintainer admits, chardet was part of the training set.

It's very similar to how a friend recently sent me, "Claude managed to reverse engineer Bubble Bobble without using any reverse engineering tools, just inspecting the binary!" https://kotrotsos.medium.com/we-pointed-an-ai-at-raw-binary-files-from-1986-662ba30120f3

Which like, Claude is enough of a black box already but Bubble Bobble is also one of the most studied ROMs in history, so that's hard to evaluate whether it's true. You'd have to choose a less studied ROM as a test case, not Bubble Bobble, which the internet has discussed to death.

cwebber@social.coop

@bkuhn @evan @richardfontana @ossguy Probably a ton of people here think I am anti-AI-output, and that I would be upset to find out that the chardet rewrite were legal.

Actually, I'm not! I'd be fine with the ability to copyright launder software to some degree, as long as we could do the same for proprietary software (including in binary form).

I'm concerned about whether or not we have an *equitable* situation, though. And I'm *more concerned* that we need to advise people, who are incorporating code *today*.

evan@cosocial.ca

@bkuhn I just did an abstraction and filtration pass on a medium-sized application framework (~30K LOC), and as an expert on the code I think it did a good job:

https://claude.ai/share/071ccb69-5d22-4673-905a-362d9663e7d0

It missed a few things (e.g. relay specs). Then again, I have no idea how this kind of review is supposed to work. I didn't go down to the function or statement level -- that'd probably be much noisier.

Maybe chardet 2 and 7 would be a better test of the technique?

@richardfontana @cwebber @ossguy

richardjacton@fosstodon.org

@cwebber @bkuhn @ossguy @richardfontana I'd don't see a great way out of the copyright stripping conclusions for them without changes to the law. As I understand their defense for training on copyrighted materials - it's predicated on the models being a "transformative" and not competing directly with the original works in the market. The models themselves don't compete with the training material only their outputs do - and the LLM companies want any liability for that to be on users not them.

richardjacton@fosstodon.org

@cwebber @bkuhn @ossguy @richardfontana Under this view it doesn't matter how the training data was licensed as it's a fair use defense. The outputs being uncopyrightable / effectively public domain allows people to claim they wrote it when it's convenient and they want to be able to copyright it as it's hard to prove if it was AI generated or human authored. And simultaneously to claim that it was the output of and LLM when they want to strip inconvenient licensing terms.

evan@cosocial.ca

If I were going to productize this, I'd do AF passes on a huge training dataset like The Stack and generate some kind of fingerprint for each program. (Estimated cost: billions!)

https://huggingface.co/datasets/bigcode/the-stack

Then, I'd have a tool to let you fingerprint your own code and C it against the big database -- maybe give you a list of high-similarity codebases.

And you could re-run the comparison each time you push to Git -- maybe only Cing what changed.

@bkuhn @richardfontana @cwebber @ossguy