Skip to content
  • Hjem
  • Seneste
  • Etiketter
  • Populære
  • Verden
  • Bruger
  • Grupper
Temaer
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Kollaps
FARVEL BIG TECH
  1. Forside
  2. Fediverse
  3. What are Fedi Admins doing to block Meta scrapers?

What are Fedi Admins doing to block Meta scrapers?

Planlagt Fastgjort Låst Flyttet Fediverse
fediverse
6 Indlæg 5 Posters 0 Visninger
  • Ældste til nyeste
  • Nyeste til ældste
  • Most Votes
Svar
  • Svar som emne
Login for at svare
Denne tråd er blevet slettet. Kun brugere med emne behandlings privilegier kan se den.
  • ? Offline
    ? Offline
    Gæst
    wrote sidst redigeret af
    #1

    cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824

    As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.

    So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …

    And a couple of more open-ended questions:

    • Do you feel like your defenses against scraping are generally holding up pretty well?

    • Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?

    • Do you have any language in your terms of servive that attempts to prohibit training for AI?

    Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

    https://cyberpunk.lol/@FediPact/114999480874284493

    @fediverse @fediversenews

    #MastoAdmin #Meta #FediPact

    jeena@piefed.jeena.netJ C rimu@piefed.socialR 3 Replies Last reply
    27
    • ? Gæst

      cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824

      As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.

      So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …

      And a couple of more open-ended questions:

      • Do you feel like your defenses against scraping are generally holding up pretty well?

      • Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?

      • Do you have any language in your terms of servive that attempts to prohibit training for AI?

      Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

      https://cyberpunk.lol/@FediPact/114999480874284493

      @fediverse @fediversenews

      #MastoAdmin #Meta #FediPact

      jeena@piefed.jeena.netJ This user is from outside of this forum
      jeena@piefed.jeena.netJ This user is from outside of this forum
      jeena@piefed.jeena.net
      wrote sidst redigeret af
      #2

      The only thing I’ve been doing on my blog (not on my piefed instance yet, but probably should) was user agent filtering:

      if ($http_user_agent ~* (SemrushBot|AhrefsBot|PetalBot|YisouSpider|Amazonbot|VelenPublicWebCrawler|DataForSeoBot|Expanse,\ a\ Palo\ Alto\ Networks\ company|BacklinksExtendedBot|ClaudeBot|OAI-SearchBot)) {
      return 403;
      }

      1 Reply Last reply
      9
      • ? Gæst

        cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824

        As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.

        So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …

        And a couple of more open-ended questions:

        • Do you feel like your defenses against scraping are generally holding up pretty well?

        • Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?

        • Do you have any language in your terms of servive that attempts to prohibit training for AI?

        Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

        https://cyberpunk.lol/@FediPact/114999480874284493

        @fediverse @fediversenews

        #MastoAdmin #Meta #FediPact

        C This user is from outside of this forum
        C This user is from outside of this forum
        camerondev@programming.dev
        wrote sidst redigeret af
        #3

        Just to clarify your question, are you concerned about metas scrapers causing additional server load, or about them stealing the content?

        T 1 Reply Last reply
        1
        • C camerondev@programming.dev

          Just to clarify your question, are you concerned about metas scrapers causing additional server load, or about them stealing the content?

          T This user is from outside of this forum
          T This user is from outside of this forum
          the_abecedarian@piefed.social
          wrote sidst redigeret af
          #4

          Not OP but i’d be concerned about both

          C 1 Reply Last reply
          2
          • T the_abecedarian@piefed.social

            Not OP but i’d be concerned about both

            C This user is from outside of this forum
            C This user is from outside of this forum
            camerondev@programming.dev
            wrote sidst redigeret af
            #5

            The nature of federation makes the later basically impossible to prevent. All data is federated freely, so all meta has to do is spin up an instance and the data is handed directly to them.

            1 Reply Last reply
            5
            • ? Gæst

              cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824

              As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.

              So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose …

              And a couple of more open-ended questions:

              • Do you feel like your defenses against scraping are generally holding up pretty well?

              • Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?

              • Do you have any language in your terms of servive that attempts to prohibit training for AI?

              Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

              https://cyberpunk.lol/@FediPact/114999480874284493

              @fediverse @fediversenews

              #MastoAdmin #Meta #FediPact

              rimu@piefed.socialR This user is from outside of this forum
              rimu@piefed.socialR This user is from outside of this forum
              rimu@piefed.social
              wrote sidst redigeret af rimu@piefed.social
              #6

              There are no PieFed instances in that list. Maybe because Meta is blocked in the default PieFed robots.txt or maybe PieFed is too obscure.

              The robots.txt on Mastodon and Lemmy so minimal it borders on negligent.

              The Mbin robots.txt is massive but does not block Meta’s crawler so presumably it is not being kept up to date.

              Any fedi devs reading this: add these

              User-agent: meta-externalagent  
              User-agent: Meta-ExternalAgent  
              User-agent: meta-externalfetcher  
              User-agent: Meta-ExternalFetcher  
              User-agent: TikTokSpider  
              User-agent: DuckAssistBot  
              User-agent: anthropic-ai  
              Disallow: /  
              
              1 Reply Last reply
              5
              Svar
              • Svar som emne
              Login for at svare
              • Ældste til nyeste
              • Nyeste til ældste
              • Most Votes


              • Log ind

              • Har du ikke en konto? Tilmeld

              • Login or register to search.
              Powered by NodeBB Contributors
              Graciously hosted by data.coop
              • First post
                Last post
              0
              • Hjem
              • Seneste
              • Etiketter
              • Populære
              • Verden
              • Bruger
              • Grupper