What are Fedi Admins doing to block Meta scrapers?
-
cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824
As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.
So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …
And a couple of more open-ended questions:
-
Do you feel like your defenses against scraping are generally holding up pretty well?
-
Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?
-
Do you have any language in your terms of servive that attempts to prohibit training for AI?
Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.
-
-
cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824
As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.
So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …
And a couple of more open-ended questions:
-
Do you feel like your defenses against scraping are generally holding up pretty well?
-
Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?
-
Do you have any language in your terms of servive that attempts to prohibit training for AI?
Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.
The only thing I’ve been doing on my blog (not on my piefed instance yet, but probably should) was user agent filtering:
if ($http_user_agent ~* (SemrushBot|AhrefsBot|PetalBot|YisouSpider|Amazonbot|VelenPublicWebCrawler|DataForSeoBot|Expanse,\ a\ Palo\ Alto\ Networks\ company|BacklinksExtendedBot|ClaudeBot|OAI-SearchBot)) {
return 403;
} -
-
cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824
As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.
So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …
And a couple of more open-ended questions:
-
Do you feel like your defenses against scraping are generally holding up pretty well?
-
Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?
-
Do you have any language in your terms of servive that attempts to prohibit training for AI?
Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.
Just to clarify your question, are you concerned about metas scrapers causing additional server load, or about them stealing the content?
-
-
Just to clarify your question, are you concerned about metas scrapers causing additional server load, or about them stealing the content?
Not OP but i’d be concerned about both
-
Not OP but i’d be concerned about both
The nature of federation makes the later basically impossible to prevent. All data is federated freely, so all meta has to do is spin up an instance and the data is handed directly to them.
-
cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824
As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.
So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …
And a couple of more open-ended questions:
-
Do you feel like your defenses against scraping are generally holding up pretty well?
-
Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?
-
Do you have any language in your terms of servive that attempts to prohibit training for AI?
Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.
There are no PieFed instances in that list. Maybe because Meta is blocked in the default PieFed robots.txt or maybe PieFed is too obscure.
The robots.txt on Mastodon and Lemmy so minimal it borders on negligent.
The Mbin robots.txt is massive but does not block Meta’s crawler so presumably it is not being kept up to date.
Any fedi devs reading this: add these
User-agent: meta-externalagent User-agent: Meta-ExternalAgent User-agent: meta-externalfetcher User-agent: Meta-ExternalFetcher User-agent: TikTokSpider User-agent: DuckAssistBot User-agent: anthropic-ai Disallow: /
-