Skip to content
  • Hjem
  • Seneste
  • Etiketter
  • Populære
  • Verden
  • Bruger
  • Grupper
Temaer
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Kollaps
FARVEL BIG TECH
  1. Forside
  2. Ikke-kategoriseret
  3. My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies.

My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies.

Planlagt Fastgjort Låst Flyttet Ikke-kategoriseret
45 Indlæg 31 Posters 0 Visninger
  • Ældste til nyeste
  • Nyeste til ældste
  • Most Votes
Svar
  • Svar som emne
Login for at svare
Denne tråd er blevet slettet. Kun brugere med emne behandlings privilegier kan se den.
  • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

    My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

    M This user is from outside of this forum
    M This user is from outside of this forum
    mgd81@infosec.exchange
    wrote sidst redigeret af
    #35

    @jwildeboer At work, I've simply begun blocking /8's at the firewall.... it's easier and actually causes less collateral damage than one might assume at this point.

    1 Reply Last reply
    0
    • alan@lighthouse.co.imA alan@lighthouse.co.im

      @jwildeboer It's weird that the response is to take sites down rather than reach for technical countermeasures -- rate limiting, UA filtering, datacenter ASN blocks. Is the residential proxy problem genuinely that hard to solve at scale, or is the downtime itself the point? A visible protest signal rather than a quiet WAF tweak feels like a different kind of statement about where people think the leverage actually is?

      dascandy@infosec.exchangeD This user is from outside of this forum
      dascandy@infosec.exchangeD This user is from outside of this forum
      dascandy@infosec.exchange
      wrote sidst redigeret af
      #36

      @alan @jwildeboer The residential proxy is the *industrialized* scale use of smart TVs to host a proxy for companies to use to redirect requests through, so it's actually most people's regular TVs that are attacking you.

      Which also means that if you block any of them, you're cutting off actual users too. Residences have one IP, and most users don't even know they're hosting a proxy for companies to lease.

      1 Reply Last reply
      0
      • damonhd@mastodon.socialD damonhd@mastodon.social

        @nxadm @koen_hufkens @jwildeboer One set of those 'residential proxies' is apparently compromised 'smart' TVs; another is stuff silently embedded in 'free' mobile games. We all pay the price for those.

        dascandy@infosec.exchangeD This user is from outside of this forum
        dascandy@infosec.exchangeD This user is from outside of this forum
        dascandy@infosec.exchange
        wrote sidst redigeret af
        #37

        @DamonHD @nxadm @koen_hufkens @jwildeboer Not so much compromise as shipped with the device.

        1 Reply Last reply
        0
        • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

          @TimWardCam It's not necessarily the AI companies themselves. There's a whole new sector of (VC-backed) startups that claim to be able to deliver perfectly clean and curated training data for domain-specific models. And in a weird turn of events, they find out that many crawlers running in big datacenters are now being blocked by many sites they want to scrape. So using the "residential proxy IP" botnets seems to them a good option.

          harry_wood@en.osm.townH This user is from outside of this forum
          harry_wood@en.osm.townH This user is from outside of this forum
          harry_wood@en.osm.town
          wrote sidst redigeret af
          #38

          @jwildeboer @TimWardCam Yes. This was puzzling me. Surely the big AI providers, OpenAI, Google, etc, wouldn't want to damage their brand by operating scrapers so incompetently.

          But no. It's not them. The scraperpocalypse coincides with the arrival of LLMs *partly* because of increased demand for data sets, but partly just because LLM coding enables vast armies of script kiddies to easily develop scrapers that use circumvention tactics.

          jwildeboer@social.wildeboer.netJ 1 Reply Last reply
          0
          • harry_wood@en.osm.townH harry_wood@en.osm.town

            @jwildeboer @TimWardCam Yes. This was puzzling me. Surely the big AI providers, OpenAI, Google, etc, wouldn't want to damage their brand by operating scrapers so incompetently.

            But no. It's not them. The scraperpocalypse coincides with the arrival of LLMs *partly* because of increased demand for data sets, but partly just because LLM coding enables vast armies of script kiddies to easily develop scrapers that use circumvention tactics.

            jwildeboer@social.wildeboer.netJ This user is from outside of this forum
            jwildeboer@social.wildeboer.netJ This user is from outside of this forum
            jwildeboer@social.wildeboer.net
            wrote sidst redigeret af
            #39

            @harry_wood The scrapers that hammer my server aren't using circumvention tactics and are, in fact, very stupid ones. What makes them hard to block is that they come from all over the world, with unique IP addresses that have no clearly identifiable origin. That's the "residential IP proxy" effect. They do a few requests and disappear again. So rate limiting doesn't catch them. @TimWardCam

            1 Reply Last reply
            0
            • algernon@come-from.mad-scientist.clubA algernon@come-from.mad-scientist.club

              @grumpybozo @alan @jwildeboer FWIW, you can still mitigate most of them if you look at headers other than the user agent.

              Many of the crawlers that try to disguise themselves as real browsers utterly fail at sending headers those browsers would, like sec-fetch-mode on any HTTPS request.

              With few exceptions, if the UA contains Chrome/ or Firefox/, and the request doesn't have a sec-fetch-mode header, the chance of it being a crawler is almost certain.

              I've been successfully mitigating pretty much all of them for about a year now (from ~100 million requests/day down to 3 million, the majority of which is served garbage).

              b_b@mastodon.roflcopter.frB This user is from outside of this forum
              b_b@mastodon.roflcopter.frB This user is from outside of this forum
              b_b@mastodon.roflcopter.fr
              wrote sidst redigeret af
              #40

              @algernon @grumpybozo @alan @jwildeboer Did you try to use that trick ? What tool did you used and did it works well ?

              algernon@come-from.mad-scientist.clubA 1 Reply Last reply
              0
              • b_b@mastodon.roflcopter.frB b_b@mastodon.roflcopter.fr

                @algernon @grumpybozo @alan @jwildeboer Did you try to use that trick ? What tool did you used and did it works well ?

                algernon@come-from.mad-scientist.clubA This user is from outside of this forum
                algernon@come-from.mad-scientist.clubA This user is from outside of this forum
                algernon@come-from.mad-scientist.club
                wrote sidst redigeret af
                #41

                @b_b @grumpybozo @alan @jwildeboer I've been using this trick (+ a few tweaks) for about a year now, with iocaine, with great success.

                jwildeboer@social.wildeboer.netJ 1 Reply Last reply
                0
                • algernon@come-from.mad-scientist.clubA algernon@come-from.mad-scientist.club

                  @b_b @grumpybozo @alan @jwildeboer I've been using this trick (+ a few tweaks) for about a year now, with iocaine, with great success.

                  jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                  jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                  jwildeboer@social.wildeboer.net
                  wrote sidst redigeret af
                  #42

                  @algernon A wonderful understatement. Perfect answer 🙂 @b_b @grumpybozo @alan

                  1 Reply Last reply
                  0
                  • rytmis@hachyderm.ioR rytmis@hachyderm.io

                    @agturcz @alan @jwildeboer

                    Apparently many ”smart” TV manufacturers ship proxy SDKs from companies like Bright, and they turn the TVs into nodes in a botnet that is used for ”AI” data scraping, so the traffic comes from all over the place.

                    I’d guess not many consumers know about it, let alone have the technical know-how to prevent it.

                    profpatsch@mastodon.xyzP This user is from outside of this forum
                    profpatsch@mastodon.xyzP This user is from outside of this forum
                    profpatsch@mastodon.xyz
                    wrote sidst redigeret af
                    #43

                    @rytmis @agturcz @alan @jwildeboer omg this is a monetization angle for TVs that is just so obvious when you consider the race to the bottom in that industry

                    rytmis@hachyderm.ioR 1 Reply Last reply
                    0
                    • profpatsch@mastodon.xyzP profpatsch@mastodon.xyz

                      @rytmis @agturcz @alan @jwildeboer omg this is a monetization angle for TVs that is just so obvious when you consider the race to the bottom in that industry

                      rytmis@hachyderm.ioR This user is from outside of this forum
                      rytmis@hachyderm.ioR This user is from outside of this forum
                      rytmis@hachyderm.io
                      wrote sidst redigeret af
                      #44

                      @Profpatsch @agturcz @alan @jwildeboer

                      Yep. I just read about it some weeks back and immediately tried to look for dumb TVs as an alternative. Of course, they don’t really exist as a product category any more, so the next best thing was to block those things at the router. ☹️

                      1 Reply Last reply
                      0
                      • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                        My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

                        nicd@masto.ahlcode.fiN This user is from outside of this forum
                        nicd@masto.ahlcode.fiN This user is from outside of this forum
                        nicd@masto.ahlcode.fi
                        wrote sidst redigeret af
                        #45

                        @jwildeboer Just this week a repository in my Forgejo instance was under attack. In a day, I racked up over 130k distinct IPs with fail2ban and had to abandon that approach.

                        I now have a simple trick that cut out practically all of the traffic, but I hesitate to share it as it's not difficult to work around… I wish we didn't have to resort to such things.

                        1 Reply Last reply
                        0
                        • jwcph@helvede.netJ jwcph@helvede.net shared this topic
                        Svar
                        • Svar som emne
                        Login for at svare
                        • Ældste til nyeste
                        • Nyeste til ældste
                        • Most Votes


                        • Log ind

                        • Har du ikke en konto? Tilmeld

                        • Login or register to search.
                        Powered by NodeBB Contributors
                        Graciously hosted by data.coop
                        • First post
                          Last post
                        0
                        • Hjem
                        • Seneste
                        • Etiketter
                        • Populære
                        • Verden
                        • Bruger
                        • Grupper