Ask HN: How to keep Chinese crawlers from taking down my site?

6 points by idlewords 12 hours ago | 12 comments

I run Pinboard, a bookmarking website with about 20K active users.

By design, public pages on Pinboard are supposed to be visible to anyone, whether or not they have an account on the site. However, since about August of 2024 I have found myself overwhelmed with bot traffic from crawlers, which has forced me to put all pages on the site behind a login.

In the past, it was relatively easy to stop aggressive crawling by blocking on IP range or on some feature of the user agent string. This crawling is different—it comes from thousands of distinct IP addresses that make one or two requests each, and the user agent strings spoof normal browsers. Sampling this traffic shows it comes almost entirely from Hong Kong and mainland Chinese IP addresses. It averages about 1 request/second, although there are times when it can hit 4 requests/second or more.

The way Pinboard is designed, certain public pages (especially views of user bookmarks filtered by multiple tags) are expensive to generate. In ordinary circumstances, this is not an issue, but bot traffic spread across dozens of user+tag pages can quickly overwhelm the site, especially when the bots start paginating.

My question is how to effectively block this kind of distributed crawling on a Ubuntu box without relying on a third party like Cloudflare[0]. I understand that iptables is not designed to block tens of thousands of IP addresses or ranges efficiently. What options am I left with? Hiding public pages behind a captcha? Filtering the entire China IP range using rules loaded into a frontend like nginx?

[0] This restriction is a product requirement ("no third party anything"). You may think it's silly but bear with me; my users like it.

jsheard 12 hours ago |
Dumb bulk crawlers usually don't bother running Javascript, so you might be able to mitigate it by moving your expensive logic to an API endpoint that clientside JS calls in order to populate the page. Assuming you don't need search engines to be able to see those pages.
idlewords 12 hours ago |
Thanks! That would certainly work at the cost of a little more backend complexity, which at this point seems a small price to pay.
devops000 12 hours ago |
Blocking them with Clouflare?
billybuckwheat 11 hours ago |
He clearly said he didn't want to use Cloudflare, so that's not an option.
epc 11 hours ago |
I block entire /8 and /16s from China on my personal sites. They were swamping my bandwidth requesting the same pages or images 1000s of times a day. Start by blocking the tencent and alibaba cloud networks and then work down what other networks are generating the most traffic.
idlewords 10 hours ago |
What do you use to do the blocking? Iptables or something else?
epc 9 hours ago |
Depends on the box, the cheaper sites I just use Apache deny lists, the beefier boxes have pf/pfsense installed so use that.
johng 10 hours ago |
I used to null route large blocks as well. Let them sit and wait...
route add X.X.X.X gw 127.0.0.1 lo
johng 10 hours ago |
This will help you sort by number of connections by IP address. Has been handy for me many times.
https://www.commandlinefu.com/commands/view/1767/number-of-o....
JSTrading 8 hours ago |
Block them with iptables? Use something like this. Not foolproof. https://herrbischoff.com/2021/03/herr-bischoffs-ip-blocklist... Although I like to have fun with these things I wrote a Docker image and my own C++ App so it’s fast that randomly redirects to a set of random pages I switch up now and again. Sites like this. Or random news sites. https://www.planethollywoodlondon.com/
slater 8 hours ago |
OP: "I understand that iptables is not designed to block tens of thousands of IP addresses or ranges efficiently. What options am I left with?"
You: "Block them with iptables?"
:P
active_caramel 5 hours ago |
block access with Tiananmen Square date. google it and read up, might help