Is there a way to build #consent/legal terms into the #ActivityPub protocol or is it already there?
-
@madsenandersc all clients on the Fediverse should hopefully use the ActivityPub protocol, and if a client happens to be a bot or AI scraper, I would expect them to read the field that tells them they can't do LLM training and/or personal profiling. Hopefully we can afford enforcement to tell if an LLM model suddenly contains data that was labeled to not allow this.
Even better if data protection laws like GDPR provided this protection by default, though.
Again, it's a protocol, not an application. Think about it: It is not the protocol that exposes the content to the AI scraper - it is the application.
Spam is not prevented by legal restrictions of the SMTP-protocol either - it's a feature in the application instead, and it is usually activated by the user (or at least possible to deactivate).
I understand what you want, and I get the reasons for it, but restricting what protocols can carry and for whom, is usually a bad idea in the long run.
There is a always a use case that you did not think of, and suddenly people starts forking or modifying new versions of the protocol, with backwards compatibility going down the drain.
-
Again, it's a protocol, not an application. Think about it: It is not the protocol that exposes the content to the AI scraper - it is the application.
Spam is not prevented by legal restrictions of the SMTP-protocol either - it's a feature in the application instead, and it is usually activated by the user (or at least possible to deactivate).
I understand what you want, and I get the reasons for it, but restricting what protocols can carry and for whom, is usually a bad idea in the long run.
There is a always a use case that you did not think of, and suddenly people starts forking or modifying new versions of the protocol, with backwards compatibility going down the drain.
@madsenandersc I can also write a scraper that doesn't care about robots.txt. People would then discover that the scraper acts in ways we don't like and try to block it.
I think it's the same, if you exercise the implications of what it would mean if ActivityPub had a distinct field for how the data is allowed to be used. Whether that means legal action or blocking certain things.
-
@madsenandersc I can also write a scraper that doesn't care about robots.txt. People would then discover that the scraper acts in ways we don't like and try to block it.
I think it's the same, if you exercise the implications of what it would mean if ActivityPub had a distinct field for how the data is allowed to be used. Whether that means legal action or blocking certain things.
@madsenandersc It seems like we're just missing something because we don't even have a basic way to say "no, my post isn't allowed for LLM training"
There's also the good old HTTP field "Do Not Track".
-
@madsenandersc It seems like we're just missing something because we don't even have a basic way to say "no, my post isn't allowed for LLM training"
There's also the good old HTTP field "Do Not Track".
@madsenandersc If we just conclude that AI scrapers are bad actors and won't respect anything, there isn't much to do... but if at least, we put up a boundary, then they can't claim we didn't say so? Otherwise, it's just a free pass at scraping the Fediverse and using the data for whatever purpose...
-
@madsenandersc I can also write a scraper that doesn't care about robots.txt. People would then discover that the scraper acts in ways we don't like and try to block it.
I think it's the same, if you exercise the implications of what it would mean if ActivityPub had a distinct field for how the data is allowed to be used. Whether that means legal action or blocking certain things.
I don't follow the reasoning behind your example here?
If the field in the ActivityPub protocol cannot be enforced (like the entry in robots.txt), why bother then? It will just be like the "Do not track"-field in your browser settings that will give those that don't know any better a false sense of security. Or do I misunderstand your example?
I still firmly believe that this problem can only be solved reliably at the application level, not the protocol.
-
@madsenandersc If we just conclude that AI scrapers are bad actors and won't respect anything, there isn't much to do... but if at least, we put up a boundary, then they can't claim we didn't say so? Otherwise, it's just a free pass at scraping the Fediverse and using the data for whatever purpose...
I'll be the devils advocate here. If the protocol forbids scraping, who is to blame in a legal fight if scraping occur? The application provider? Because the scraper never touched anything related to the protocol and never saw if the flag was set on the post.
Is the flag to be set at all times? Can you prove that the content was transported by the ActivityPub protocol, and that the scaper definitely was aware of that? Because if you can't, legal action is not possible anyway.
This is not about what is right and what is desireable, it's about what is possible to prove in a courtroom and who you can blame for any misdeed. I am pretty sure that a good lawyer will be able to shift the blame to the application displaying the content to the scraper, or at least make the case that the scraper had no way of knowing that this particular content was off-limit.
Yes, you can implement an extension of the protocol with the field you talk about, but it will be of no use at all, unless you get all possible Fediverse clients to agree to that field and implement precautions against scraping if the user has set the field. If the scraper is successful anyway, the Fediverse client will be the one to blame for not implementing the precauitions well enough.
-
I don't follow the reasoning behind your example here?
If the field in the ActivityPub protocol cannot be enforced (like the entry in robots.txt), why bother then? It will just be like the "Do not track"-field in your browser settings that will give those that don't know any better a false sense of security. Or do I misunderstand your example?
I still firmly believe that this problem can only be solved reliably at the application level, not the protocol.
@madsenandersc I think there's a genuine chance to tell when a post shows up in an LLM model and it had a field that said "don't train LLMs with this post".
The rest of the enforcement work is about figuring out if the scraper could have seen that field and chose to ignore it.
Even if posts can be exposed through applications that filter out the field, it could be possible to prove that an LLM (or ad profiling) has been using data from the source, hence had access to the field.
-
@madsenandersc I think there's a genuine chance to tell when a post shows up in an LLM model and it had a field that said "don't train LLMs with this post".
The rest of the enforcement work is about figuring out if the scraper could have seen that field and chose to ignore it.
Even if posts can be exposed through applications that filter out the field, it could be possible to prove that an LLM (or ad profiling) has been using data from the source, hence had access to the field.
@madsenandersc The other part of my reasoning is to say: If we don't declare anything, then we've just lost from the beginning. We SHOULD find a way to declare the consent of data sharing wrt. the ActivityPub protocol.
Otherwise we have NOTHING. Like literally nothing? Do you know of any kind of existing restriction?
-
I'll be the devils advocate here. If the protocol forbids scraping, who is to blame in a legal fight if scraping occur? The application provider? Because the scraper never touched anything related to the protocol and never saw if the flag was set on the post.
Is the flag to be set at all times? Can you prove that the content was transported by the ActivityPub protocol, and that the scaper definitely was aware of that? Because if you can't, legal action is not possible anyway.
This is not about what is right and what is desireable, it's about what is possible to prove in a courtroom and who you can blame for any misdeed. I am pretty sure that a good lawyer will be able to shift the blame to the application displaying the content to the scraper, or at least make the case that the scraper had no way of knowing that this particular content was off-limit.
Yes, you can implement an extension of the protocol with the field you talk about, but it will be of no use at all, unless you get all possible Fediverse clients to agree to that field and implement precautions against scraping if the user has set the field. If the scraper is successful anyway, the Fediverse client will be the one to blame for not implementing the precauitions well enough.
@madsenandersc I think that it's both possible and likely that we would have ActivityPub applications relaying data without the intended consent field. Because of course they exist for both good and bad reasons.
But if you have not said or stated anything, it seems terribly unclear what YOUR intention/consent is. So I would basically like a way to say that MY post isn't intended for LLM training.
That's just the beginning.
-
@madsenandersc I think that it's both possible and likely that we would have ActivityPub applications relaying data without the intended consent field. Because of course they exist for both good and bad reasons.
But if you have not said or stated anything, it seems terribly unclear what YOUR intention/consent is. So I would basically like a way to say that MY post isn't intended for LLM training.
That's just the beginning.
@madsenandersc What happens after can be *a lot*. I'm sure that *a lot* of people would agree to let their applications echo this consent.
Through logging, you can conclude if an AI scraper visited your endpoints that contained this field and ignored it. That's possible.
But it's also possible to advocate litigation or legislation through a proven interest.
-
@madsenandersc The other part of my reasoning is to say: If we don't declare anything, then we've just lost from the beginning. We SHOULD find a way to declare the consent of data sharing wrt. the ActivityPub protocol.
Otherwise we have NOTHING. Like literally nothing? Do you know of any kind of existing restriction?
Ah - I think I see where we fundamentally disagree about all of this. We definitely agree on the desired outcome, but not on the means to get there.

Basically I think it boils down to the fact that you want a technical solution where different actors have a part in the responsibility and possible culpability, where I want a legal solution that isn't relying on a technical foundation, but rather on a legal framework that outlaws scraping content where it isn't explicitly stated that it is available for AI training.
That brings me back to the application versus protocol again. You can have an application where the content is the payment, and where you as a user allow AI scraping as a way to pay for the service. It will have to be clearly stated that this is the agreement between you and the service provider, like it is today with e.g. Facebook and others.
Think about it: You can either start playing whack-a-mole with flags added to protocols, or you can simply outlaw scraping unless explicitly allowed - protocols be damned.
Either way you will need a legal framework to handle this, and a protocol flag without the legal backing is at best worthless ("Do not track"), at worst a sense of false security.
I think the fight to get a general ban and a ban per protocol will be just about the same.