By design, public pages on Pinboard are supposed to be visible to anyone, whether or not they have an account on the site. However, since about August of 2024 I have found myself overwhelmed with bot traffic from crawlers, which has forced me to put all pages on the site behind a login.
In the past, it was relatively easy to stop aggressive crawling by blocking on IP range or on some feature of the user agent string. This crawling is different—it comes from thousands of distinct IP addresses that make one or two requests each, and the user agent strings spoof normal browsers. Sampling this traffic shows it comes almost entirely from Hong Kong and mainland Chinese IP addresses. It averages about 1 request/second, although there are times when it can hit 4 requests/second or more.
The way Pinboard is designed, certain public pages (especially views of user bookmarks filtered by multiple tags) are expensive to generate. In ordinary circumstances, this is not an issue, but bot traffic spread across dozens of user+tag pages can quickly overwhelm the site, especially when the bots start paginating.
My question is how to effectively block this kind of distributed crawling on a Ubuntu box without relying on a third party like Cloudflare[0]. I understand that iptables is not designed to block tens of thousands of IP addresses or ranges efficiently. What options am I left with? Hiding public pages behind a captcha? Filtering the entire China IP range using rules loaded into a frontend like nginx?
[0] This restriction is a product requirement ("no third party anything"). You may think it's silly but bear with me; my users like it.
route add X.X.X.X gw 127.0.0.1 lo
https://www.commandlinefu.com/commands/view/1767/number-of-o....
You: "Block them with iptables?"
:P