https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...
Good job perusing it tho, that's fantastic. (ps, big fan of your product, great work on that too!)
It's one thing if a company ignores robots.txt and causes serious interference with the service, like Perplexity was, but the details here don't really add up: this company didn't have a robots.txt in place, and although the article mentions tens/hundreds of thousands of requests, they don't say anything about them being made unreasonably quickly.
The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.
EDIT: They're a very media-heavy website. Here's one of the product pages from their catalog: https://triplegangers.com/browse/scans/full-body/sara-liang-.... Each of the body-pose images is displayed at about 35x70px but is served as a 500x1000px image. It now seems like they have some cloudflare caching in place at least.
I stand by my belief that unless we get some evidence that they were being scraped particularly aggressively, this is on them, and this is being blown out of proportion for publicity.
Learning how, is sometimes actually learning who's going to get you online in a good way.
In this case when you have non-tech people building Wordpress sites, it's about what they can understand and do, and teh rate of learning doesn't always keep up relative to client work.
> The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.
These two statements are at odds, I hope you realize. You say public accessibility of information is a good thing, while blaming someone for being effectively DDOS'd as a result of having said information public.
The clickbaity hysteria here is missing out how this sort of scraping has been possible long before AI agents showed up a couple of years back.
[0] https://web.archive.org/web/20221206134212/https://www.tripl...
Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.
They publicly published the site for their customers to browse, with the side benefit that curious people could also use the site in moderation since it wasn't affecting them in any real way. OpenAI isn't their customer, and their use is affecting them in terms of hosting costs and lost revenue from downtime.
The obvious next step is to gate that data behind a login, and now we (the entire world) all have slightly less information at our fingertips because OpenAI did what they do.
The point is that OpenAI, or anyone doing massive scraping ops should know better by now. Sure, the small company that doesn't do web design had a single file misconfigured, but that shouldn't be a 4 or 5 figure mistake. OpenAI knows what bandwidth costs. There should be a mechanism that says, hey, we have asked for many gigabytes or terrabytes of data from a single domain scrape, that is a problem.
If I stock a Little Free Library at the end of my driveway, it's because I want people in the community to peruse and swap the books in a way that's intuitive to pretty much everyone who might encounter it.
I shouldn't need to post a sign outside of it saying "Please don't just take all of these at once", and it'd be completely reasonable for me to feel frustrated if someone did misuse it -- regardless of whether the sign was posted or not.
Just because something is technically possible and not illegal does NOT make it the right thing to do.
There's no chance every single website in existence is going to have a flawless setup. That's guaranteed simply from the number of websites, and how old some of them are.
It's the first sentence of the article.
On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down.
If a scraper is making enough requests to take someone else's website down, the scraper's requests are being made unreasonably quickly.
They really are trying to burn all their goodwill to the ground with this stuff.
But then again if you’re in the cloud egress bandwidth is going to cost for playing this game.
Better to just deny the OpenAI crawler and send them an invoice for the money and time they’ve wasted. Interesting form of data warfare against competitors and non competitors alike. The winner will have the longest runway
couple of lines in your nginx/apache config and off you go
my content rich sites provide this "high quality" data to the parasites
most insightful, thank you! also, stay away from linkedin, you sweet summer child.
We were told at that time that the "robots.txt" enforcement was the one thing they had that wasn't fully distributed, it's a devilishly difficult thing to implement.
It boggles my mind that people with the kind of budget that some of these people have are struggling to implement crawling right 20 years later tough. It's nice those folks got a rebate.
One of the problems why people are testy today is that you pay by the GB w/ cloud providers; about 10 years ago I kicked out the sinosphere crawlers like Baidu because they were generating like 40% of the traffic on my site crawling over and over again and not sending even a single referrer.
- they don't respect the Crawl-Delay directive
- google search console reports 429s as 500s
https://developers.google.com/search/docs/crawling-indexing/...
End of the day the claim is someone's action caused someone else undue financial burden in an way that is not easily prevented beforehand, so I wouldn't say it's a 100% clear case but I'm also not sure a judge wouldn't entertain it.
If it were possible, someone would have done it by now. It hasn't happened because robots.txt has absolutely no legal weight whatsoever. It's entirely voluntary, which means it's perfectly legal not to volunteer.
But if you or anyone else wants to waste their time tilting at legal windmills, have fun ¯\_(ツ)_/¯.
The suit itself is the mechanism for determining whether the harm existed.
And yes, of course, this presents much opportunity for abuse.
Edit 1: I'm surprised by the bandwidth costs. I use hetzner and OVH and the bandwidth is free. Though you manage the bare metal server yourself. Would readthedocs ever consider switching to self-managed hosting to save costs on cloud hosting?
Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.
This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.
As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.
They do. Marc Andreeson said as much in his "techno-optimist manifesto," that any hesitation or slowdown in AI development or adoption is equivalent to mass murder.
It’s a waste of bandwidth and CPU on their end as well, “the bitter lesson” isn’t “keep duplicating the same training data”.
I’m glad DeepSeek is showing how inefficient and dogshit most of the frontier model engineer is - how much VC is getting burned literally redownloading a copy of the entire web daily when like <1% of it is new data.
I get they have no shame economically, that they are deluded and greedy. But bad engineering is another class of sin!
In years of running webcrawlers I've had very little trouble, I've had more trouble in the last year than in the past 25. (Wrote my first crawler in '99, funny my crawlers have gotten simpler over time not more complex)
In one case I found a site got terribly slow although I was hitting it at much less than 1 request per second. Careful observation showed the wheels were coming off the site and it had nothing to do with me.
There's another site that I've probably crawled in it's entirety at least ten times over the past twenty years. I have a crawl from two years ago, my plan was to feed it into a BERT-based system not for training but to discover content that is like the content that I like. I thought I'd get a fresh copy w/ httrack (polite, respects robots.txt, ...) and they blocked both my home IP addresses in 10 minutes. (Granted I don't think the past 2 years of this site was as good as the past, so I will just load what I have into my semantic search & tagging system and use that instead)
I was angry about how unfair the Google Economy was in 2013, in lines with what this blogger has been saying ever since
(I can say it's a strange way to market an expensive SEO community but...) and it drives me up the wall that people looking in the rear view mirror are getting upset about it now.
Back in '98 I was excited about "personal webcrawlers" that could be your own web agent. On one hand LLMs could give so much utility in terms of classification, extraction, clustering and otherwise drinking from that firehose but the fear that somebody is stealing their precious creativity is going to close the door forever... And entrench a completely unfair Google Economy. It makes me sad.
----
Oddly those stupid ReCAPTCHAs and Cloudflare CAPTCHAs torment me all the time as a human but I haven't once had them get in the way of a crawling project.
>“We’re in a business where the rights are kind of a serious issue, because we scan actual people,” he said. With laws like Europe’s GDPR, “they cannot just take a photo of anyone on the web and use it.”
Yes, and protecting that data was your responsibility, Tomchuck. You dropped the ball and are now trying to blame the other players.
Or is that still my fault somehow?
Maybe we should stop blaming people for "letting" themselves get destroyed and maybe put some blame on the people actively choosing to behave in a way that harms everyone else?
But then again, they have so much money so we should all just bend over and take it, right?
As for bending over, if you serve files and they request files, then you send them files, what exactly is the problem? That you didn't implement any kind of rate limiting? It's a web-based company and these things are just the basics.
"As Tomchuk experienced, if a site isn’t properly using robot.txt, OpenAI and others take that to mean they can scrape to their hearts’ content."
The takeaway: check your robots.txt.
The question of how much load requests robots can reasonably generate when allowed is a separate matter.
All this effort is futile because AI bots will simply send false user agents, but it's something.
https://platform.openai.com/docs/bots
It does not document how the bot will respond to things like 429 response codes. It does not document how frequently their bots may crawl a given domain/url. It does not document any way to get in touch about abusive crawler practices.
It could also be an interesting dataset for exposing the IPs those shady "anonymous scraping" comp intel companies use..
The point of 429 is that you will not be using up your limited bandwidth sending the actual response, which will save you at least 99% of your bandwidth quota. It is not to find IPs to block, especially if the requestor gives up after a few requests.
The IPs that you actually need to block are the ones that are actually DoSing you without stopping even after a few retries, and even then only temporarily.
The IP addresses in the screenshot are all owned by Cloudflare, meaning that their server logs are only recording the IPs of Cloudflare's reverse proxy, not the real client IPs.
Also, the logs don't show any timestamps and there doesn't seem to be any mention of the request rate in the whole article.
I'm not trying to defend OpenAI but as someone who scrapes data I think it's unfair to throw around terms "like DDOS attack" without providing basic request rate metrics. This seems to be purely based on the use of multiple IPs, which was actually caused by their own server configuration and has nothing to do with OpenAI.
How about this: these folks put up a website in order to serve customers, not for OpenAI to scoop up all their data for their own benefit. In my opinion data should only be made available to "AI" companies on an opt-in basis, but given today's reality OpenAI should at least be polite about how they harvest data.
That and the fact that they're using a log file with the timestamps omitted as evidence of "how ruthelessly an OpenAI bot was accessing the site" makes the claims in the article a bit suspect.
OpenAI isn't necessarily in the clear here, but this is a low-quality article that doesn't provide much signal either way.
AI companies cause most of traffic on forums - https://news.ycombinator.com/item?id=42549624 - Dec 2024 (438 comments)
Now everybody calls this abuse. And a lot of it is abuse, to be fair.
Now that has been mostly blocked. Every website tries really hard to block bots (and mostly fail because Google funds their crawler millions of dollars while companies raise a stink over paying a single SWE), but it's still at the point that automated interactions with companies (through third-party services for example) are not really possible. I cannot give my credit card info to a company and have it order my favorite foods to my home every day, for example.
What AI promises, in a way, is to re-enable this. Because AI bots are unblockable (they're more human than humans as far as these tests are concerned). For companies, and for users. And that would be a way to ... put API's into people and companies again.
Back to step 1.
Yes, VCs want this because it's an opportunity for a double-sided marketplace, but I still want it too.
I wonder to what extent is what these FANG businesses want with AI can be described as just "an API into businesses that don't want to provide an API".
It's time to level up in this arms race. Let's stop delivering html documents, use animated rendering of information that is positioned in a scene so that the user has to move elements around for it to be recognizable, like a full site captcha. It doesn't need to be overly complex for the user that can intuitively navigate even a 3D world, but will take x1000 more processing for OpenAI. Feel free to come up with your creative designs to make automation more difficult.
It seems that any router or switch over 100G is extremely expensive, and often requires some paid for OS.
The pro move would be to not block these bots. Well I guess block them if you truly can't handle their throughput request (would an ASN blacklist work?)
Or if you want to force them to slow down, start sending data, but only a random % of responses are sent (so say ignore 85% of the traffic they spam you with, and reply to the others at a super low rate or you could purposely send bad data)
Or perhaps reachout to your peering partners and talk about traffic shaping these requests.