My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies.

jab01701mid@mastodon.social

@jwildeboer @alan It's sad that we can't use MAC address-based filtering on the IoT client devices themselves. All of them reserve blocks of MAC addresses, usually from the NIC manufacturer's block, where it would be easy to block all traffic from "Samsung TV sets" or "Roku Devices" or "Apple TV".

rytmis@hachyderm.io

@agturcz @alan @jwildeboer

Apparently many ”smart” TV manufacturers ship proxy SDKs from companies like Bright, and they turn the TVs into nodes in a botnet that is used for ”AI” data scraping, so the traffic comes from all over the place.

I’d guess not many consumers know about it, let alone have the technical know-how to prevent it.

woozle@toot.cat

@jwildeboer @alan

I have a proposal: BotID

chrisp@cyberplace.social

@dzwiedziu @jwildeboer Web 4.0, invite only with lobste.rs style reputation system.

photo55@mastodon.social

@jwildeboer @alan
I suppose one could rate limit the site's outward traffic, without reference to where it is going ...
And then have a very long list of specific addresses which get an extra rate.
And that list, which would be like a Squid access list, could be shared in some sections among quite a lot of sites.

mgd81@infosec.exchange

@jwildeboer At work, I've simply begun blocking /8's at the firewall.... it's easier and actually causes less collateral damage than one might assume at this point.

dascandy@infosec.exchange

@alan @jwildeboer The residential proxy is the *industrialized* scale use of smart TVs to host a proxy for companies to use to redirect requests through, so it's actually most people's regular TVs that are attacking you.

Which also means that if you block any of them, you're cutting off actual users too. Residences have one IP, and most users don't even know they're hosting a proxy for companies to lease.

dascandy@infosec.exchange

@DamonHD @nxadm @koen_hufkens @jwildeboer Not so much compromise as shipped with the device.

harry_wood@en.osm.town

@jwildeboer @TimWardCam Yes. This was puzzling me. Surely the big AI providers, OpenAI, Google, etc, wouldn't want to damage their brand by operating scrapers so incompetently.

But no. It's not them. The scraperpocalypse coincides with the arrival of LLMs *partly* because of increased demand for data sets, but partly just because LLM coding enables vast armies of script kiddies to easily develop scrapers that use circumvention tactics.

jwildeboer@social.wildeboer.net

@harry_wood The scrapers that hammer my server aren't using circumvention tactics and are, in fact, very stupid ones. What makes them hard to block is that they come from all over the world, with unique IP addresses that have no clearly identifiable origin. That's the "residential IP proxy" effect. They do a few requests and disappear again. So rate limiting doesn't catch them. @TimWardCam

b_b@mastodon.roflcopter.fr

@algernon @grumpybozo @alan @jwildeboer Did you try to use that trick ? What tool did you used and did it works well ?

algernon@come-from.mad-scientist.club

@b_b @grumpybozo @alan @jwildeboer I've been using this trick (+ a few tweaks) for about a year now, with iocaine, with great success.

jwildeboer@social.wildeboer.net

@algernon A wonderful understatement. Perfect answer @b_b @grumpybozo @alan

profpatsch@mastodon.xyz

@rytmis @agturcz @alan @jwildeboer omg this is a monetization angle for TVs that is just so obvious when you consider the race to the bottom in that industry

rytmis@hachyderm.io

@Profpatsch @agturcz @alan @jwildeboer

Yep. I just read about it some weeks back and immediately tried to look for dumb TVs as an alternative. Of course, they don’t really exist as a product category any more, so the next best thing was to block those things at the router. ️

nicd@masto.ahlcode.fi

@jwildeboer Just this week a repository in my Forgejo instance was under attack. In a day, I racked up over 130k distinct IPs with fail2ban and had to abandon that approach.

I now have a simple trick that cut out practically all of the traffic, but I hesitate to share it as it's not difficult to work around… I wish we didn't have to resort to such things.