My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies.

damonhd@mastodon.social

@nxadm @koen_hufkens @jwildeboer One set of those 'residential proxies' is apparently compromised 'smart' TVs; another is stuff silently embedded in 'free' mobile games. We all pay the price for those.

nxadm@infosec.exchange

@DamonHD @koen_hufkens @jwildeboer

That's alarming.

jamoteusz@mastodon.com.pl

@jwildeboer “the great digital theft” known as “knowledge harvesting”

agturcz@circumstances.run

@alan I'm sorry for asking that, but do you actually know what "residential proxy" is?

@jwildeboer

leonardodiottio@mastodon.social

@alan @jwildeboer As these IPs are largely from people’s personal connections (for instance because they have a malware infected Smart TV or router, or run some kind of smart TV/free game/browser extension with this dubious code deliberately inserted in it) you would effectively be blocking entire consumer ISPs.

If you run a regular website you want people using regular consumer ISPs to reach it.

That makes the use of these proxies so effective.

jwildeboer@social.wildeboer.net

@alan These botnets are more or less immune to rate limiting, as they use many (and I mean millions) of IP addresses fro a run and each IP address is only used for a few requests before it is being put back in the queue. The IP addresses are also from many different providers, so a (sub-)net wide block also doesn't help. I wrote about those "residential IP proxies in [1] and [2].

[1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
[2] https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-2/

kruthoff@mastodon.social

@jwildeboer I can confirm. Even on my small blog I see over 1200 different IP addresses scraping since a while.

alan@lighthouse.co.im

@jwildeboer 1/2
Solidarity -- you're not alone in this. The one-attempt-per-IP pattern is specifically designed to be invisible to anything threshold-based. CrowdSec helps at the edges but a fresh residential IP making a single SASL attempt looks like a legitimate user having a bad day. Your manual cronjob approach is the right call. Automation just gives you false confidence.

alan@lighthouse.co.im

@jwildeboer 2/2
The deeper problem is upstream. Apple, Google and Microsoft are allowing SDK-injected bandwidth harvesting through their app stores. Until that's addressed at source, we're all playing whack-a-mole with an essentially infinite residential IP pool. This isn't a mail security problem -- it's a platform accountability problem.

galooph@masto.galooph.com

@jwildeboer We're definitely seeing this at @codeenigma

hannorein@mastodon.social

@jwildeboer same here. The documentation of REBOUND https://rebound.hanno-rein.de is getting hammered with scrapers. It was never a problem before because the website is small and only contains static pages. Ridiculous.

dzwiedziu@mastodon.social

@jwildeboer
This is going to end up with invitation-only graynet, where you'll be banned the moment you'll try trawling.

timwardcam@c.im

@jwildeboer Why would an AI company need millions of copies of the same data?

rupert@mastodon.nz

@TimWardCam @jwildeboer Because their trawler is as sloppily coded as everything else they do.

maswan@mastodon.acc.sunet.se

@jwildeboer Yup. Our free software mirror sometimes takes a minute to respond to an apt update, because there's millions of scraper IPs hitting the entire namespace (we use a fast caching of popular files for performance) saturating the backend storage.

We've tried various blocking and qos approaches, but we have yet to find something that really helps.

Baseline performance for us is 10-40Gbit/s, and we are now down to hardware upgrades in the hopes that real users will have enough left over.

jwildeboer@social.wildeboer.net

@TimWardCam It's not necessarily the AI companies themselves. There's a whole new sector of (VC-backed) startups that claim to be able to deliver perfectly clean and curated training data for domain-specific models. And in a weird turn of events, they find out that many crawlers running in big datacenters are now being blocked by many sites they want to scrape. So using the "residential proxy IP" botnets seems to them a good option.

jwildeboer@social.wildeboer.net

@maswan Ouch

froztbyte@mastodon.social

@LeonardoDiOttio @alan @jwildeboer it's not only that kind of stuff, fwiw - go look up bright data's sdk (for example), then do some speculative math on how many people are out there with phones that are full of apps with that sort of shit in it (there's more than only bright-sdk out there)

grumpybozo@toad.social

@alan @jwildeboer The attacks have thwarted all of those tactics. They use UAs constructed from real UA tokens with minor variations. They have graduated from cheap VMs on Huawei Cloud and Digital Ocean to random IoTs in millions of households and mobile devices in millions of hands.
A few days ago I was able to measure over a thousand simultaneous sessions, each from a different /16 network.
My response to that isn’t taking the site down, but I am shedding load aggressively.

avuko@infosec.exchange

@alan @jwildeboer

It is all an accountability problem. ‍️