Skip to content
  • Hjem
  • Seneste
  • Etiketter
  • Populære
  • Verden
  • Bruger
  • Grupper
Temaer
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Kollaps
FARVEL BIG TECH
  1. Forside
  2. Ikke-kategoriseret
  3. My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies.

My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies.

Planlagt Fastgjort Låst Flyttet Ikke-kategoriseret
45 Indlæg 31 Posters 0 Visninger
  • Ældste til nyeste
  • Nyeste til ældste
  • Most Votes
Svar
  • Svar som emne
Login for at svare
Denne tråd er blevet slettet. Kun brugere med emne behandlings privilegier kan se den.
  • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

    My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

    koen_hufkens@mastodon.socialK This user is from outside of this forum
    koen_hufkens@mastodon.socialK This user is from outside of this forum
    koen_hufkens@mastodon.social
    wrote sidst redigeret af
    #4

    @jwildeboer I think this should be illegal in the EU if found out. Not entirely sure.

    nxadm@infosec.exchangeN 1 Reply Last reply
    0
    • koen_hufkens@mastodon.socialK koen_hufkens@mastodon.social

      @jwildeboer I think this should be illegal in the EU if found out. Not entirely sure.

      nxadm@infosec.exchangeN This user is from outside of this forum
      nxadm@infosec.exchangeN This user is from outside of this forum
      nxadm@infosec.exchange
      wrote sidst redigeret af
      #5

      @koen_hufkens @jwildeboer

      De facto botnets sound illegal to me. But in today's EU only if the company is Russian or Chinese. US companies do whatever they like.

      damonhd@mastodon.socialD 1 Reply Last reply
      0
      • nxadm@infosec.exchangeN nxadm@infosec.exchange

        @koen_hufkens @jwildeboer

        De facto botnets sound illegal to me. But in today's EU only if the company is Russian or Chinese. US companies do whatever they like.

        damonhd@mastodon.socialD This user is from outside of this forum
        damonhd@mastodon.socialD This user is from outside of this forum
        damonhd@mastodon.social
        wrote sidst redigeret af
        #6

        @nxadm @koen_hufkens @jwildeboer One set of those 'residential proxies' is apparently compromised 'smart' TVs; another is stuff silently embedded in 'free' mobile games. We all pay the price for those.

        nxadm@infosec.exchangeN dascandy@infosec.exchangeD 2 Replies Last reply
        0
        • damonhd@mastodon.socialD damonhd@mastodon.social

          @nxadm @koen_hufkens @jwildeboer One set of those 'residential proxies' is apparently compromised 'smart' TVs; another is stuff silently embedded in 'free' mobile games. We all pay the price for those.

          nxadm@infosec.exchangeN This user is from outside of this forum
          nxadm@infosec.exchangeN This user is from outside of this forum
          nxadm@infosec.exchange
          wrote sidst redigeret af
          #7

          @DamonHD @koen_hufkens @jwildeboer

          That's alarming.

          1 Reply Last reply
          0
          • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

            My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

            jamoteusz@mastodon.com.plJ This user is from outside of this forum
            jamoteusz@mastodon.com.plJ This user is from outside of this forum
            jamoteusz@mastodon.com.pl
            wrote sidst redigeret af
            #8

            @jwildeboer “the great digital theft” known as “knowledge harvesting”

            1 Reply Last reply
            0
            • alan@lighthouse.co.imA alan@lighthouse.co.im

              @jwildeboer It's weird that the response is to take sites down rather than reach for technical countermeasures -- rate limiting, UA filtering, datacenter ASN blocks. Is the residential proxy problem genuinely that hard to solve at scale, or is the downtime itself the point? A visible protest signal rather than a quiet WAF tweak feels like a different kind of statement about where people think the leverage actually is?

              agturcz@circumstances.runA This user is from outside of this forum
              agturcz@circumstances.runA This user is from outside of this forum
              agturcz@circumstances.run
              wrote sidst redigeret af
              #9

              @alan I'm sorry for asking that, but do you actually know what "residential proxy" is?

              @jwildeboer

              rytmis@hachyderm.ioR 1 Reply Last reply
              0
              • alan@lighthouse.co.imA alan@lighthouse.co.im

                @jwildeboer It's weird that the response is to take sites down rather than reach for technical countermeasures -- rate limiting, UA filtering, datacenter ASN blocks. Is the residential proxy problem genuinely that hard to solve at scale, or is the downtime itself the point? A visible protest signal rather than a quiet WAF tweak feels like a different kind of statement about where people think the leverage actually is?

                leonardodiottio@mastodon.socialL This user is from outside of this forum
                leonardodiottio@mastodon.socialL This user is from outside of this forum
                leonardodiottio@mastodon.social
                wrote sidst redigeret af
                #10

                @alan @jwildeboer As these IPs are largely from people’s personal connections (for instance because they have a malware infected Smart TV or router, or run some kind of smart TV/free game/browser extension with this dubious code deliberately inserted in it) you would effectively be blocking entire consumer ISPs.

                If you run a regular website you want people using regular consumer ISPs to reach it.

                That makes the use of these proxies so effective.

                F 1 Reply Last reply
                0
                • alan@lighthouse.co.imA alan@lighthouse.co.im

                  @jwildeboer It's weird that the response is to take sites down rather than reach for technical countermeasures -- rate limiting, UA filtering, datacenter ASN blocks. Is the residential proxy problem genuinely that hard to solve at scale, or is the downtime itself the point? A visible protest signal rather than a quiet WAF tweak feels like a different kind of statement about where people think the leverage actually is?

                  jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                  jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                  jwildeboer@social.wildeboer.net
                  wrote sidst redigeret af
                  #11

                  @alan These botnets are more or less immune to rate limiting, as they use many (and I mean millions) of IP addresses fro a run and each IP address is only used for a few requests before it is being put back in the queue. The IP addresses are also from many different providers, so a (sub-)net wide block also doesn't help. I wrote about those "residential IP proxies in [1] and [2].

                  [1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
                  [2] https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-2/

                  alan@lighthouse.co.imA jab01701mid@mastodon.socialJ woozle@toot.catW photo55@mastodon.socialP 4 Replies Last reply
                  0
                  • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                    My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

                    kruthoff@mastodon.socialK This user is from outside of this forum
                    kruthoff@mastodon.socialK This user is from outside of this forum
                    kruthoff@mastodon.social
                    wrote sidst redigeret af
                    #12

                    @jwildeboer I can confirm. Even on my small blog I see over 1200 different IP addresses scraping since a while.

                    1 Reply Last reply
                    0
                    • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                      @alan These botnets are more or less immune to rate limiting, as they use many (and I mean millions) of IP addresses fro a run and each IP address is only used for a few requests before it is being put back in the queue. The IP addresses are also from many different providers, so a (sub-)net wide block also doesn't help. I wrote about those "residential IP proxies in [1] and [2].

                      [1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
                      [2] https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-2/

                      alan@lighthouse.co.imA This user is from outside of this forum
                      alan@lighthouse.co.imA This user is from outside of this forum
                      alan@lighthouse.co.im
                      wrote sidst redigeret af
                      #13

                      @jwildeboer 1/2
                      Solidarity -- you're not alone in this. The one-attempt-per-IP pattern is specifically designed to be invisible to anything threshold-based. CrowdSec helps at the edges but a fresh residential IP making a single SASL attempt looks like a legitimate user having a bad day. Your manual cronjob approach is the right call. Automation just gives you false confidence.

                      alan@lighthouse.co.imA 1 Reply Last reply
                      0
                      • alan@lighthouse.co.imA alan@lighthouse.co.im

                        @jwildeboer 1/2
                        Solidarity -- you're not alone in this. The one-attempt-per-IP pattern is specifically designed to be invisible to anything threshold-based. CrowdSec helps at the edges but a fresh residential IP making a single SASL attempt looks like a legitimate user having a bad day. Your manual cronjob approach is the right call. Automation just gives you false confidence.

                        alan@lighthouse.co.imA This user is from outside of this forum
                        alan@lighthouse.co.imA This user is from outside of this forum
                        alan@lighthouse.co.im
                        wrote sidst redigeret af
                        #14

                        @jwildeboer 2/2
                        The deeper problem is upstream. Apple, Google and Microsoft are allowing SDK-injected bandwidth harvesting through their app stores. Until that's addressed at source, we're all playing whack-a-mole with an essentially infinite residential IP pool. This isn't a mail security problem -- it's a platform accountability problem.

                        avuko@infosec.exchangeA dalias@hachyderm.ioD 2 Replies Last reply
                        0
                        • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                          My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

                          galooph@masto.galooph.comG This user is from outside of this forum
                          galooph@masto.galooph.comG This user is from outside of this forum
                          galooph@masto.galooph.com
                          wrote sidst redigeret af
                          #15

                          @jwildeboer We're definitely seeing this at @codeenigma 😞

                          1 Reply Last reply
                          0
                          • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                            My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

                            hannorein@mastodon.socialH This user is from outside of this forum
                            hannorein@mastodon.socialH This user is from outside of this forum
                            hannorein@mastodon.social
                            wrote sidst redigeret af
                            #16

                            @jwildeboer same here. The documentation of REBOUND https://rebound.hanno-rein.de is getting hammered with scrapers. It was never a problem before because the website is small and only contains static pages. Ridiculous.

                            1 Reply Last reply
                            0
                            • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                              My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

                              dzwiedziu@mastodon.socialD This user is from outside of this forum
                              dzwiedziu@mastodon.socialD This user is from outside of this forum
                              dzwiedziu@mastodon.social
                              wrote sidst redigeret af
                              #17

                              @jwildeboer
                              This is going to end up with invitation-only graynet, where you'll be banned the moment you'll try trawling.

                              chrisp@cyberplace.socialC 1 Reply Last reply
                              0
                              • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                                My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

                                timwardcam@c.imT This user is from outside of this forum
                                timwardcam@c.imT This user is from outside of this forum
                                timwardcam@c.im
                                wrote sidst redigeret af
                                #18

                                @jwildeboer Why would an AI company need millions of copies of the same data?

                                rupert@mastodon.nzR jwildeboer@social.wildeboer.netJ 2 Replies Last reply
                                0
                                • timwardcam@c.imT timwardcam@c.im

                                  @jwildeboer Why would an AI company need millions of copies of the same data?

                                  rupert@mastodon.nzR This user is from outside of this forum
                                  rupert@mastodon.nzR This user is from outside of this forum
                                  rupert@mastodon.nz
                                  wrote sidst redigeret af
                                  #19

                                  @TimWardCam @jwildeboer Because their trawler is as sloppily coded as everything else they do.

                                  dalias@hachyderm.ioD 1 Reply Last reply
                                  0
                                  • jwildeboer@social.wildeboer.netJ jwildeboer@social.wildeboer.net

                                    My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in 😞

                                    maswan@mastodon.acc.sunet.seM This user is from outside of this forum
                                    maswan@mastodon.acc.sunet.seM This user is from outside of this forum
                                    maswan@mastodon.acc.sunet.se
                                    wrote sidst redigeret af
                                    #20

                                    @jwildeboer Yup. Our free software mirror sometimes takes a minute to respond to an apt update, because there's millions of scraper IPs hitting the entire namespace (we use a fast caching of popular files for performance) saturating the backend storage.

                                    We've tried various blocking and qos approaches, but we have yet to find something that really helps.

                                    Baseline performance for us is 10-40Gbit/s, and we are now down to hardware upgrades in the hopes that real users will have enough left over.

                                    jwildeboer@social.wildeboer.netJ 1 Reply Last reply
                                    0
                                    • timwardcam@c.imT timwardcam@c.im

                                      @jwildeboer Why would an AI company need millions of copies of the same data?

                                      jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                                      jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                                      jwildeboer@social.wildeboer.net
                                      wrote sidst redigeret af
                                      #21

                                      @TimWardCam It's not necessarily the AI companies themselves. There's a whole new sector of (VC-backed) startups that claim to be able to deliver perfectly clean and curated training data for domain-specific models. And in a weird turn of events, they find out that many crawlers running in big datacenters are now being blocked by many sites they want to scrape. So using the "residential proxy IP" botnets seems to them a good option.

                                      harry_wood@en.osm.townH 1 Reply Last reply
                                      0
                                      • maswan@mastodon.acc.sunet.seM maswan@mastodon.acc.sunet.se

                                        @jwildeboer Yup. Our free software mirror sometimes takes a minute to respond to an apt update, because there's millions of scraper IPs hitting the entire namespace (we use a fast caching of popular files for performance) saturating the backend storage.

                                        We've tried various blocking and qos approaches, but we have yet to find something that really helps.

                                        Baseline performance for us is 10-40Gbit/s, and we are now down to hardware upgrades in the hopes that real users will have enough left over.

                                        jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                                        jwildeboer@social.wildeboer.netJ This user is from outside of this forum
                                        jwildeboer@social.wildeboer.net
                                        wrote sidst redigeret af
                                        #22

                                        @maswan Ouch 😞

                                        1 Reply Last reply
                                        0
                                        • leonardodiottio@mastodon.socialL leonardodiottio@mastodon.social

                                          @alan @jwildeboer As these IPs are largely from people’s personal connections (for instance because they have a malware infected Smart TV or router, or run some kind of smart TV/free game/browser extension with this dubious code deliberately inserted in it) you would effectively be blocking entire consumer ISPs.

                                          If you run a regular website you want people using regular consumer ISPs to reach it.

                                          That makes the use of these proxies so effective.

                                          F This user is from outside of this forum
                                          F This user is from outside of this forum
                                          froztbyte@mastodon.social
                                          wrote sidst redigeret af
                                          #23

                                          @LeonardoDiOttio @alan @jwildeboer it's not only that kind of stuff, fwiw - go look up bright data's sdk (for example), then do some speculative math on how many people are out there with phones that are full of apps with that sort of shit in it (there's more than only bright-sdk out there)

                                          1 Reply Last reply
                                          0
                                          Svar
                                          • Svar som emne
                                          Login for at svare
                                          • Ældste til nyeste
                                          • Nyeste til ældste
                                          • Most Votes


                                          • Log ind

                                          • Har du ikke en konto? Tilmeld

                                          • Login or register to search.
                                          Powered by NodeBB Contributors
                                          Graciously hosted by data.coop
                                          • First post
                                            Last post
                                          0
                                          • Hjem
                                          • Seneste
                                          • Etiketter
                                          • Populære
                                          • Verden
                                          • Bruger
                                          • Grupper