<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies.]]></title><description><![CDATA[<p>My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in <img src="https://forum.fedi.dk/assets/plugins/nodebb-plugin-emoji/emoji/android/1f61e.png?v=7979fdcf9c7" class="not-responsive emoji emoji-android emoji--disappointed" style="height:23px;width:auto;vertical-align:middle" title=":(" alt="😞" /></p>]]></description><link>https://forum.fedi.dk/topic/20541b71-b1bf-4d91-9ca9-229bb09150c8/my-timeline-which-contains-a-lot-of-project-leaders-sysadmins-from-big-projects-is-filling-with-posts-about-a-new-ongoing-wave-of-what-most-likely-are-scrapers-collecting-training-data-for-ai-companies.</link><generator>RSS for Node</generator><lastBuildDate>Sun, 05 Jul 2026 07:22:36 GMT</lastBuildDate><atom:link href="https://forum.fedi.dk/topic/20541b71-b1bf-4d91-9ca9-229bb09150c8.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 29 Jun 2026 08:12:38 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Fri, 03 Jul 2026 04:44:39 GMT]]></title><description><![CDATA[<p><span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> Just this week a repository in my Forgejo instance was under attack. In a day, I racked up over 130k distinct IPs with fail2ban and had to abandon that approach.</p><p>I now have a simple trick that cut out practically all of the traffic, but I hesitate to share it as it's not difficult to work around… I wish we didn't have to resort to such things.</p>]]></description><link>https://forum.fedi.dk/post/https://masto.ahlcode.fi/users/nicd/statuses/116854219038130655</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://masto.ahlcode.fi/users/nicd/statuses/116854219038130655</guid><dc:creator><![CDATA[nicd@masto.ahlcode.fi]]></dc:creator><pubDate>Fri, 03 Jul 2026 04:44:39 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Thu, 02 Jul 2026 13:56:25 GMT]]></title><description><![CDATA[<p><span><a href="/user/profpatsch%40mastodon.xyz">@<span>Profpatsch</span></a></span> <span><a href="/user/agturcz%40circumstances.run">@<span>agturcz</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> </p><p>Yep. I just read about it some weeks back and immediately tried to look for dumb TVs as an alternative. Of course, they don’t really exist as a product category any more, so the next best thing was to block those things at the router. <img src="https://forum.fedi.dk/assets/plugins/nodebb-plugin-emoji/emoji/android/2639.png?v=7979fdcf9c7" class="not-responsive emoji emoji-android emoji--white_frowning_face" style="height:23px;width:auto;vertical-align:middle" title="☹" alt="☹" />️</p>]]></description><link>https://forum.fedi.dk/post/https://hachyderm.io/users/rytmis/statuses/116850726384989544</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://hachyderm.io/users/rytmis/statuses/116850726384989544</guid><dc:creator><![CDATA[rytmis@hachyderm.io]]></dc:creator><pubDate>Thu, 02 Jul 2026 13:56:25 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Thu, 02 Jul 2026 13:48:06 GMT]]></title><description><![CDATA[<p><span><a href="/user/rytmis%40hachyderm.io">@<span>rytmis</span></a></span> <span><a href="/user/agturcz%40circumstances.run">@<span>agturcz</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> omg this is a monetization angle for TVs that is just so obvious when you consider the race to the bottom in that industry</p>]]></description><link>https://forum.fedi.dk/post/https://mastodon.xyz/users/Profpatsch/statuses/116850693700828996</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://mastodon.xyz/users/Profpatsch/statuses/116850693700828996</guid><dc:creator><![CDATA[profpatsch@mastodon.xyz]]></dc:creator><pubDate>Thu, 02 Jul 2026 13:48:06 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Wed, 01 Jul 2026 20:39:51 GMT]]></title><description><![CDATA[<p><span><a href="/user/algernon%40come-from.mad-scientist.club">@<span>algernon</span></a></span> A wonderful understatement. Perfect answer <img src="https://forum.fedi.dk/assets/plugins/nodebb-plugin-emoji/emoji/android/1f642.png?v=7979fdcf9c7" class="not-responsive emoji emoji-android emoji--slightly_smiling_face" style="height:23px;width:auto;vertical-align:middle" title=":)" alt="🙂" /> <span><a href="https://mastodon.roflcopter.fr/@b_b">@<span>b_b</span></a></span> <span><a href="/user/grumpybozo%40toad.social">@<span>grumpybozo</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span></p>]]></description><link>https://forum.fedi.dk/post/https://social.wildeboer.net/users/jwildeboer/statuses/116846650395992745</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://social.wildeboer.net/users/jwildeboer/statuses/116846650395992745</guid><dc:creator><![CDATA[jwildeboer@social.wildeboer.net]]></dc:creator><pubDate>Wed, 01 Jul 2026 20:39:51 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Wed, 01 Jul 2026 20:38:51 GMT]]></title><description><![CDATA[<p><span><a href="https://mastodon.roflcopter.fr/@b_b" rel="nofollow noreferrer noopener">@<span>b_b</span></a></span> <span><a href="/user/grumpybozo%40toad.social" rel="nofollow noreferrer noopener">@<span>grumpybozo</span></a></span> <span><a href="/user/alan%40lighthouse.co.im" rel="nofollow noreferrer noopener">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net" rel="nofollow noreferrer noopener">@<span>jwildeboer</span></a></span> I've been using this trick (+ a few tweaks) for about a year now, with <a href="https://iocaine.madhouse-project.org/" rel="nofollow noreferrer noopener">iocaine</a>, with great success.</p>]]></description><link>https://forum.fedi.dk/post/https://come-from.mad-scientist.club/users/algernon/statuses/01KWFPFT6TM7T8K0C28V0M2W91</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://come-from.mad-scientist.club/users/algernon/statuses/01KWFPFT6TM7T8K0C28V0M2W91</guid><dc:creator><![CDATA[algernon@come-from.mad-scientist.club]]></dc:creator><pubDate>Wed, 01 Jul 2026 20:38:51 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Wed, 01 Jul 2026 19:21:32 GMT]]></title><description><![CDATA[<p><span><a href="/user/algernon%40come-from.mad-scientist.club">@<span>algernon</span></a></span> <span><a href="/user/grumpybozo%40toad.social">@<span>grumpybozo</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> Did you try to use that trick ? What tool did you used and did it works well ?</p>]]></description><link>https://forum.fedi.dk/post/https://mastodon.roflcopter.fr/users/b_b/statuses/116846342501667642</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://mastodon.roflcopter.fr/users/b_b/statuses/116846342501667642</guid><dc:creator><![CDATA[b_b@mastodon.roflcopter.fr]]></dc:creator><pubDate>Wed, 01 Jul 2026 19:21:32 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Tue, 30 Jun 2026 10:49:22 GMT]]></title><description><![CDATA[<p><span><a href="https://en.osm.town/@harry_wood">@<span>harry_wood</span></a></span> The scrapers that hammer my server aren't using circumvention tactics and are, in fact, very stupid ones. What makes them hard to block is that they come from all over the world, with unique IP addresses that have no clearly identifiable origin. That's the "residential IP proxy" effect. They do a few requests and disappear again. So rate limiting doesn't catch them. <span><a href="/user/timwardcam%40c.im">@<span>TimWardCam</span></a></span></p>]]></description><link>https://forum.fedi.dk/post/https://social.wildeboer.net/users/jwildeboer/statuses/116838666215720036</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://social.wildeboer.net/users/jwildeboer/statuses/116838666215720036</guid><dc:creator><![CDATA[jwildeboer@social.wildeboer.net]]></dc:creator><pubDate>Tue, 30 Jun 2026 10:49:22 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Tue, 30 Jun 2026 09:59:51 GMT]]></title><description><![CDATA[<p><span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> <span><a href="/user/timwardcam%40c.im">@<span>TimWardCam</span></a></span> Yes. This was puzzling me. Surely the big AI providers, OpenAI, Google, etc, wouldn't want to damage their brand by operating scrapers so incompetently.</p><p>But no. It's not them. The scraperpocalypse coincides with the arrival of LLMs *partly* because of increased demand for data sets, but partly just because LLM coding enables vast armies of script kiddies to easily develop scrapers that use circumvention tactics.</p>]]></description><link>https://forum.fedi.dk/post/https://en.osm.town/users/harry_wood/statuses/116838471548911060</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://en.osm.town/users/harry_wood/statuses/116838471548911060</guid><dc:creator><![CDATA[harry_wood@en.osm.town]]></dc:creator><pubDate>Tue, 30 Jun 2026 09:59:51 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Tue, 30 Jun 2026 07:04:00 GMT]]></title><description><![CDATA[<p><span><a href="/user/damonhd%40mastodon.social">@<span>DamonHD</span></a></span> <span><a href="/user/nxadm%40infosec.exchange">@<span>nxadm</span></a></span> <span><a href="/user/koen_hufkens%40mastodon.social">@<span>koen_hufkens</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> Not so much compromise as shipped with the device.</p>]]></description><link>https://forum.fedi.dk/post/https://infosec.exchange/users/dascandy/statuses/116837780068649537</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://infosec.exchange/users/dascandy/statuses/116837780068649537</guid><dc:creator><![CDATA[dascandy@infosec.exchange]]></dc:creator><pubDate>Tue, 30 Jun 2026 07:04:00 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Tue, 30 Jun 2026 06:57:40 GMT]]></title><description><![CDATA[<p><span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> The residential proxy is the *industrialized* scale use of smart TVs to host a proxy for companies to use to redirect requests through, so it's actually most people's regular TVs that are attacking you.</p><p>Which also means that if you block any of them, you're cutting off actual users too. Residences have one IP, and most users don't even know they're hosting a proxy for companies to lease.</p>]]></description><link>https://forum.fedi.dk/post/https://infosec.exchange/users/dascandy/statuses/116837755178556468</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://infosec.exchange/users/dascandy/statuses/116837755178556468</guid><dc:creator><![CDATA[dascandy@infosec.exchange]]></dc:creator><pubDate>Tue, 30 Jun 2026 06:57:40 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Tue, 30 Jun 2026 02:25:07 GMT]]></title><description><![CDATA[<p><span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> At work, I've simply begun blocking /8's at the firewall.... it's easier and actually causes less collateral damage than one might assume at this point.</p>]]></description><link>https://forum.fedi.dk/post/https://infosec.exchange/users/mgd81/statuses/116836683462584255</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://infosec.exchange/users/mgd81/statuses/116836683462584255</guid><dc:creator><![CDATA[mgd81@infosec.exchange]]></dc:creator><pubDate>Tue, 30 Jun 2026 02:25:07 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 22:58:04 GMT]]></title><description><![CDATA[<p><span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <br />I suppose one could rate limit the site's outward traffic, without reference to where it is going ...<br />And then have a very long list of specific addresses which get an extra rate.<br />And that list, which would be like a Squid access list, could be shared in some sections among quite a lot of sites.</p>]]></description><link>https://forum.fedi.dk/post/https://mastodon.social/users/Photo55/statuses/116835869287072521</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://mastodon.social/users/Photo55/statuses/116835869287072521</guid><dc:creator><![CDATA[photo55@mastodon.social]]></dc:creator><pubDate>Mon, 29 Jun 2026 22:58:04 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 22:56:28 GMT]]></title><description><![CDATA[<p><span><a href="/user/dzwiedziu%40mastodon.social">@<span>dzwiedziu</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> Web 4.0, invite only with lobste.rs style reputation system.</p>]]></description><link>https://forum.fedi.dk/post/https://cyberplace.social/users/chrisp/statuses/116835863022959790</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://cyberplace.social/users/chrisp/statuses/116835863022959790</guid><dc:creator><![CDATA[chrisp@cyberplace.social]]></dc:creator><pubDate>Mon, 29 Jun 2026 22:56:28 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 21:48:21 GMT]]></title><description><![CDATA[<p><span><a href="/user/jwildeboer%40social.wildeboer.net" rel="nofollow noopener">@<span>jwildeboer</span></a></span> <span><a href="/user/alan%40lighthouse.co.im" rel="nofollow noopener">@<span>alan</span></a></span> </p><p>I have a proposal: <a href="https://wooz.dev/BotID" rel="nofollow noopener">BotID</a></p>]]></description><link>https://forum.fedi.dk/post/https://toot.cat/users/woozle/statuses/116835595126444432</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://toot.cat/users/woozle/statuses/116835595126444432</guid><dc:creator><![CDATA[woozle@toot.cat]]></dc:creator><pubDate>Mon, 29 Jun 2026 21:48:21 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 20:21:41 GMT]]></title><description><![CDATA[<p><span><a href="/user/agturcz%40circumstances.run">@<span>agturcz</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> </p><p>Apparently many ”smart” TV manufacturers ship proxy SDKs from companies like Bright, and they turn the TVs into nodes in a botnet that is used for ”AI” data scraping, so the traffic comes from all over the place. </p><p>I’d guess not many consumers know about it, let alone have the technical know-how to prevent it.</p>]]></description><link>https://forum.fedi.dk/post/https://hachyderm.io/users/rytmis/statuses/116835254401595960</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://hachyderm.io/users/rytmis/statuses/116835254401595960</guid><dc:creator><![CDATA[rytmis@hachyderm.io]]></dc:creator><pubDate>Mon, 29 Jun 2026 20:21:41 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 18:55:39 GMT]]></title><description><![CDATA[<p><span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> It's sad that we can't use MAC address-based filtering on the IoT client devices themselves. All of them reserve blocks of MAC addresses, usually from the NIC manufacturer's block, where it would be easy to block all traffic from "Samsung TV sets" or "Roku Devices" or "Apple TV".</p>]]></description><link>https://forum.fedi.dk/post/https://mastodon.social/users/jab01701mid/statuses/116834916043164153</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://mastodon.social/users/jab01701mid/statuses/116834916043164153</guid><dc:creator><![CDATA[jab01701mid@mastodon.social]]></dc:creator><pubDate>Mon, 29 Jun 2026 18:55:39 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 18:42:48 GMT]]></title><description><![CDATA[<p><span><a href="/user/dalias%40hachyderm.io">@<span>dalias</span></a></span> <span><a href="/user/rupert%40mastodon.nz">@<span>rupert</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> People were doing this sort of thing before the AI garbage.</p><p>One place I worked, several years ago now, there was a performance problem.</p><p>"We'd better fix that by tweaking the cloud autoscaling parameters" they said.</p><p>FFS. I had a look at the actual code and made it go several times faster.</p>]]></description><link>https://forum.fedi.dk/post/https://c.im/users/TimWardCam/statuses/116834865570973380</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://c.im/users/TimWardCam/statuses/116834865570973380</guid><dc:creator><![CDATA[timwardcam@c.im]]></dc:creator><pubDate>Mon, 29 Jun 2026 18:42:48 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 18:39:29 GMT]]></title><description><![CDATA[<p><span><a href="/user/grumpybozo%40toad.social" rel="nofollow noreferrer noopener">@<span>grumpybozo</span></a></span> <span><a href="/user/alan%40lighthouse.co.im" rel="nofollow noreferrer noopener">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net" rel="nofollow noreferrer noopener">@<span>jwildeboer</span></a></span> FWIW, you can still mitigate most of them if you look at headers other than the user agent.</p><p>Many of the crawlers that try to disguise themselves as real browsers utterly fail at sending headers those browsers would, like <code>sec-fetch-mode</code> on any HTTPS request.</p><p>With few exceptions, if the UA contains <code>Chrome/</code> or <code>Firefox/</code>, and the request doesn't have a <code>sec-fetch-mode</code> header, the chance of it being a crawler is almost certain.</p><p>I've been successfully mitigating pretty much all of them for about a year now (from ~100 million requests/day down to 3 million, the majority of which is served garbage).</p>]]></description><link>https://forum.fedi.dk/post/https://come-from.mad-scientist.club/users/algernon/statuses/01KWAAVT9ZQ0SJNBPWW6KCV3VD</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://come-from.mad-scientist.club/users/algernon/statuses/01KWAAVT9ZQ0SJNBPWW6KCV3VD</guid><dc:creator><![CDATA[algernon@come-from.mad-scientist.club]]></dc:creator><pubDate>Mon, 29 Jun 2026 18:39:29 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 18:39:16 GMT]]></title><description><![CDATA[<p><span><a href="/user/rupert%40mastodon.nz">@<span>rupert</span></a></span> <span><a href="/user/timwardcam%40c.im">@<span>TimWardCam</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> Exactly. This is *ideological* - they deem actually-engineered solutions that do things efficiently as a backwards "dirty human" way of doing things. Obviously since their AI slop is superior, they should do the scraping in whatever way the AI slop vomits out code to do it.</p>]]></description><link>https://forum.fedi.dk/post/https://hachyderm.io/users/dalias/statuses/116834851630372436</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://hachyderm.io/users/dalias/statuses/116834851630372436</guid><dc:creator><![CDATA[dalias@hachyderm.io]]></dc:creator><pubDate>Mon, 29 Jun 2026 18:39:16 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 18:36:57 GMT]]></title><description><![CDATA[<p><span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> Yes, it is entirely Apple's and Google's fault that they are hosting botnet malware in their "walled gardens" as legitimate and vetted software. Without that, the botnets would not exist on any viable scale.</p>]]></description><link>https://forum.fedi.dk/post/https://hachyderm.io/users/dalias/statuses/116834842542173381</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://hachyderm.io/users/dalias/statuses/116834842542173381</guid><dc:creator><![CDATA[dalias@hachyderm.io]]></dc:creator><pubDate>Mon, 29 Jun 2026 18:36:57 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 16:02:05 GMT]]></title><description><![CDATA[<p><span><a href="/user/alan%40lighthouse.co.im" rel="nofollow noopener">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net" rel="nofollow noopener">@<span>jwildeboer</span></a></span> </p><p>It is <em>all</em> an accountability problem. <img src="https://forum.fedi.dk/assets/plugins/nodebb-plugin-emoji/emoji/android/1f937.png?v=7979fdcf9c7" class="not-responsive emoji emoji-android emoji--shrug" style="height:23px;width:auto;vertical-align:middle" title="🤷" alt="🤷" /><img src="https://forum.fedi.dk/assets/plugins/nodebb-plugin-emoji/emoji/android/1f3fb.png?v=7979fdcf9c7" class="not-responsive emoji emoji-android emoji--skin-tone-2" style="height:23px;width:auto;vertical-align:middle" title="🏻" alt="🏻" />‍<img src="https://forum.fedi.dk/assets/plugins/nodebb-plugin-emoji/emoji/android/2642.png?v=7979fdcf9c7" class="not-responsive emoji emoji-android emoji--male_sign" style="height:23px;width:auto;vertical-align:middle" title="♂" alt="♂" />️</p>]]></description><link>https://forum.fedi.dk/post/https://infosec.exchange/users/avuko/statuses/116834233606780619</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://infosec.exchange/users/avuko/statuses/116834233606780619</guid><dc:creator><![CDATA[avuko@infosec.exchange]]></dc:creator><pubDate>Mon, 29 Jun 2026 16:02:05 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 15:56:06 GMT]]></title><description><![CDATA[<p><span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> The attacks have thwarted all of those tactics. They use UAs constructed from real UA tokens with minor variations. They have graduated from cheap VMs on Huawei Cloud and Digital Ocean to random IoTs in millions of households and mobile devices in millions of hands. <br />A few days ago I was able to measure over a thousand simultaneous sessions, each from a different /16 network. <br />My response to that isn’t taking the site down, but I am shedding load aggressively.</p>]]></description><link>https://forum.fedi.dk/post/https://toad.social/users/grumpybozo/statuses/116834210049185518</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://toad.social/users/grumpybozo/statuses/116834210049185518</guid><dc:creator><![CDATA[grumpybozo@toad.social]]></dc:creator><pubDate>Mon, 29 Jun 2026 15:56:06 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 14:50:58 GMT]]></title><description><![CDATA[<p><span><a href="https://mastodon.social/@LeonardoDiOttio">@<span>LeonardoDiOttio</span></a></span> <span><a href="/user/alan%40lighthouse.co.im">@<span>alan</span></a></span> <span><a href="/user/jwildeboer%40social.wildeboer.net">@<span>jwildeboer</span></a></span> it's not only that kind of stuff, fwiw - go look up bright data's sdk (for example), then do some speculative math on how many people are out there with phones that are full of apps with that sort of shit in it (there's more than only bright-sdk out there)</p>]]></description><link>https://forum.fedi.dk/post/https://mastodon.social/users/froztbyte/statuses/116833953953966518</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://mastodon.social/users/froztbyte/statuses/116833953953966518</guid><dc:creator><![CDATA[froztbyte@mastodon.social]]></dc:creator><pubDate>Mon, 29 Jun 2026 14:50:58 GMT</pubDate></item><item><title><![CDATA[Reply to My timeline (which contains a lot of project leaders&#x2F;sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. on Mon, 29 Jun 2026 14:20:53 GMT]]></title><description><![CDATA[<p><span><a href="/user/maswan%40mastodon.acc.sunet.se">@<span>maswan</span></a></span> Ouch <img src="https://forum.fedi.dk/assets/plugins/nodebb-plugin-emoji/emoji/android/1f61e.png?v=7979fdcf9c7" class="not-responsive emoji emoji-android emoji--disappointed" style="height:23px;width:auto;vertical-align:middle" title=":(" alt="😞" /></p>]]></description><link>https://forum.fedi.dk/post/https://social.wildeboer.net/users/jwildeboer/statuses/116833835678758547</link><guid isPermaLink="true">https://forum.fedi.dk/post/https://social.wildeboer.net/users/jwildeboer/statuses/116833835678758547</guid><dc:creator><![CDATA[jwildeboer@social.wildeboer.net]]></dc:creator><pubDate>Mon, 29 Jun 2026 14:20:53 GMT</pubDate></item></channel></rss>