How close is it? If I ran wireshark, would the bytes be exactly the same in the exact same packets?
It could mean that the packets are the same, but timing is off by a few milliseconds.
It could mean a single HTTP request exactly matches, but when doing two requests the real browser uses a connection pool but curl doesn't. Or uses HTTP/3's fast-open abilities, etc.
etc.
Identical here means having the same fingerprint - i.e. you could not write a function to reliably distinguish traffic from one or the other implementation (and if you can then that's a bug).
These can work well in some cases but it's always a tradeoff.
When I have to do HTTP requests these days, I default to a headless browser right away, because that seems to be the best bet. Even then, some website are not readable because they use captchas and whatnot.
Evade captchas. curl user agent / heuristics are blocked by many sites these days - I'd guess many popular CDNs have pre-defined "block bots" stuff that blocks everything automated that is not a well-known search engine indexer.
Headless browsers consume orders of magnitude more resources, and execute far more requests (e.g. fetching images) than a common webscraping job would require. Having run webscraping at scale myself, the cost of operating headless browsers made us only use them as a last resort.
How do you build that table and keep it up to date? Manually?
> I personally have never seen such approach and doubt its useful for many concerns.
It's an arms race and defenders are not keen on sharing their secret sauce, though I can't be the only one who thought of this rather basic bot characteristic, multiple abuse trams probably realized this decades ago. It works pretty well against the low-resource scrapers with fakes UA strings and all the right TLS handshakes. It won't work against the headless browsers that costs scrapers more in resources and bandwidth, and there are specific countermeasures for headless browsers [1], and counter-countermeasures. It's a cat and mouse game.
1. e.g. Mouse movement, as made famous as ine signal evaluated by Google's reCAPTCHA v2, monitor resolution & window size and position, and Canvas rendering, all if which have been gradually degraded by browser anti-fingerprinting efforts. The bot war is fought on the long tail.
Blind users also might have no use for the pictures, and another possibility is if the document is longer than the screen so the picture is out of view then the user might program the client software to use lazy loading, etc.
Why is this?
I can't help but feel like these are the dying breaths of the open Internet though. All the megacorps (Google, Microsoft, Apple, CloudFlare, et al) are doing their damndest to make sure everyone is only using software approved by them, and to ensure that they can identify you. From multiple angles too (security, bots, DDoS, etc.), and it's not just limited to browsers either.
End goal seems to be: prove your identity to the megacorps so they can track everything you do and also ensure you are only doing things they approve of. I think the security arguments are just convenient rationalizations in service of this goal.
I agree with the over zealous tracking by the megacorps but this is also due to bad actors, I work for a financial company and the amount of API abuse, ATO, DDoS, nefarious bot traffic, etc. we see on a daily basis is absolutely insane
And when it does get more dangerous, is over zealous tracking the best counter for this?
I've dealt with a lot of these threats as well, and a lot are countered with rather common tools, from simple fail2ban rules to application firewalls and private subnets and whatnot. E.g. a large fai2ban rule to just ban anything that attempts to HTTP GET /admin.php or /phpmyadmin etc, even just once, gets rid of almost all nefarious bot traffic.
So, I think the amount of attacks indeed can be insane. But the amount that need over zealous tracking is to be countered, is, AFAICS, rather small.
unfortunately fail2ban wouldn't even make a dent in the attack traffic hitting the endpoints in my day-to-day work, these are attackers utilizing residential proxy infrastructure that are increasingly capable of solving JS/client-puzzle challenges.. the arms race is always escalating
[img]https://example.com/phpmyadmin/whatever.png[/img]
While I don't have experience with a great number of WAFs I'm sure sophisticated ones let you be quite specific on where you are matching text to identify bad requests.
As an aside, another "easy win" is assuming any incoming HTTP request for a dotfile is malicious. I see constant unsolicitied attempts to access `.env`, for example.
In my case, I never run anything PHP so I'll just plain block out anything PHP (same for python, lua, activedirectory etc). And, indeed, .htaccess, .env etc. A rather large list of hardcoded stuff that gets an instant-ban. It drops the bot-traffic with 90% or more.
These obviously aren't targeted attacks. Protecting against those is another issue alltogether.
Today for example I changed energy company†. I made a telephone call, from a number the company has never seen before. I told them my name (truthfully but I could have lied) and address (likewise). I agreed to about five minutes of parameters, conditions, etc. and I made one actual meaningful choice (a specific tariff, they offer two). I then provided 12 digits identifying a bank account (they will eventually check this account exists and ask it to pay them money, which by default will just work) and I'm done.
Notice that anybody could call from a burner and that would work too. They could move Aunt Sarah's energy to some random outfit, assign payments to Jim's bank account, and cause maybe an hour of stress and confusion for both Sarah and Jim when months or years later they realise the problem.
We know how to do this properly, but it would be high friction and that's not in the interests of either the "energy companies" or the politicians who created this needlessly complicated "Free Market" for energy. We could abolish that Free Market, but again that's not in their interests. So, we're stuck with this waste of our time and money, indefinitely.
There have been simpler versions of this system, which had even worse outcomes. They're clumsier to use, they cause more people to get scammed AND they result in higher cost to consumers, so that's not great. And there are better systems we can't deploy because in practice too few consumers will use them, so you'd have 0% failure but lower total engagement and that's what matters.
† They don't actually supply either gas or electricity, that's a last mile problem solved by a regulated monopoly, nor do they make electricity or drill for gas - but they do bill me for the gas and electricity I use - they're an artefact of Capitalism.
All requests produced by those bots were valid ones, nothing that could be flagged by tools like fail2ban etc (my assumption is that it would be the same for financial systems).
Any blocking or rate limiting by IP is useless, we saw about 2-3 requests per minute per IP, and those actors had access to ridiculous number of large CIDRs, blocking any IP caused it instantly replace it with another.
blocking by AS number was also mixed bag, as this list growed up really quickly, most of that were registered to suspicious looking Gmail addresses. (I feel that such activity might own significant percentage of total ipv4 space)
This was basically cat and mouse game of finding some specific characteristic in requests that matches all that traffic and filtering it, but the other side would adapt next day or on Sunday.
aggregated amount of traffic was in range of 2-20k r/s to basically heaviest endpoint in the shop, with was the main reason we needed to block that traffic (it generated 20-40x load of organic traffic)
cloudflare was also not really successful with default configuration, we had to basically challenge everyone by default with whitelist of most common regions from where we expected customers.
So best solution is to track everyone and calculate long term reputation.
As stated before the main reason we needed to block it was volume of the traffic, you migh imagine identical scenario for dealing with DDoS attack.
That doesn't compute... Captcha is almost always used in such setups.
It also looks like you could just offer an API endpoint which would return if the article is in stock or not, or even provide a webhook. Why fight them? Just make the resource usage lighter.
I'm curious now though what the articles were, if you are at liberty to share?
Some bots checked product page where we had info if product is in stock (although they tried heavenly to bypass any caches by putting garbage in URL). This kind of bots also scaled instantly to thousands checkout requests when product become available with gave no time for auto scaling to react (this was another challenge here)
This was easy to mitigate so it didn't generate almost any load on the system.
I believe we had email notification available, but it could be too high latency way for them.
I'm not sure how much I can share about articles here, but I can say that those were fairly expensive (and limited series) wardrobe products.
Because you presumably want real, returning customers, and that means those customers need to get a chance at buying those products, instead of them being scooped up by a scalper the millisecond they appear on the website.
A time sensitive hash validating each request makes it a bit harder for them without significant extra work on your part. Address sensitive is much more effective but can result in issues for users that switch between networks (using your site on the move and passing between workers networks, for instance).
But that protection depends on the use case. And that in many of my use-cases, a simple f2b with a large hardcoded list of URL paths I guarantee to never have, will drop bot-traffic with 90% or more. The last 10% then split into "hits because the IP is new" and "other, more sophisticated bots". Bots, in those cases are mostly just stupid worms, just trying out known WP exploits, default passwords on often used tools (nextcloud, phpmyadmin, etc) and so on.
I've done something similar with a large list of known harvest/scraper bots, based on their user-agent (the nice ones), or their movements. Nothing complex, just things like "/hidden-page.html that's linked, but hidden with css/js.
And with spam bots, where certain post-requests can only come from repeatedly submitting the contact form.
This, obviously isn't going to give any protection against targeted attacks. Nor will it protect against more sophisticated bots. But in some -in my case, most- use-cases, it's enough to drop bot-traffic significantly.
You wind up having to use things like tls fingerprinting with other heuristics to identify what to traffic to reject. These all take engineering hours and require infrastructure. It is SO MUCH SIMPLER to require auth and reject everything else outright.
I know that the BigCo's want to track us and you originally mentioned tracking not auth. But my point is yeah, they have malicious reasons for locking things down, but there are legitimate reasons too.
...and we've circled back to the post's subject - a version of curl that impersonates browsers TLS handshake behavior to bypass such fingerprinting.
get /token
Returns token with timestamp in salted hash
get /resource?token=abc123xyz
Check for valid token and drop or deny.
But also, what you wrote is basically nonsense. Clients don't need "an approved cert authority". Nor are there any "approved gatekeepers", all major browsers are equally happy connecting to your Raspberry Pi as they are connecting to Cloudflare.
If you're fighting adversaries that go for scale, AKA trying to hack as many targets as possible, mostly low-sophistication, using techniques requiring 0 human work and seeing what sticks, yes, blocking those simple techniques works.
Those attackers don't ever expect to hack Facebook or your bank, that's just not the business they're in. They're fine with posting unsavory ads on your local church's website, blackmailing a school principal with the explicit pictures he stores on the school server, or encrypting all the data on that server and demanding a ransom.
If your company does something that is specifically valuable to someone, and there are people whose literal job it is to attack your company's specific systems, no, those simple techniques won't be enough.
If you're protecting a Church with 150 members, the simple techniques are probably fine, if you're working for a major bank or a retailer that sells gaming consoles or concert tickets, they're laughably inadequate.
I'm guessing investors actually like a healthy dose of open access and a healthy dose of defence. We see them (YC, as an example) betting on multiple teams addressing the same problem. The difference is their execution, the angle they attack.
If, say, the financial company you work for is capable in both product and technical aspect, I assume it leaves no gap. It's the main place to access the service and all the side benefits.
Sometimes the customer you have isn't the customer you want.
As a bank, you don't want the customers that will try to log in to 1000 accounts, and then immediately transfer any money they find to the Seychelles. As a ticketing platform, you don't want the customers that buy tickets and then immediately sell them on for 4x the price. As a messaging app, you don't want the customers who have 2000 bot accounts and use AI to send hundreds of thousands of spam messages a day. As a social network, you don't want the customers who want to use your platform to spread pro-russian misinformation.
In a sense, those are "customer needs left changing", but neither you nor otherr customers want those needs to be automatible.
Child safety, as always, was the sugar that made the medicine go down in freedom-loving USA. I imagine these states' approaches will try to move to the federal level after Section 230 dies an ignominious death.
Keep an eye out for Free Speech Coalition v. Paxton to hit SCOTUS in January: https://www.oyez.org/cases/2024/23-1122
These days I just tell friends & family to assume that nothing they do is private.
AI will replace any search you would want to do to find information, the only reason to scour the internet now is for social purposes: finding comments and forums or content from other users, and you don’t really need to be untracked to do all that.
A megacorp’s main motivation for tracking your identity is to sell you shit or sell your data to other people who want to sell you things. But if you’re using AI the amount of ads and SEO spam that you have to sift through will dramatically reduce, rendering most of those efforts pointless.
And most people aren’t using the internet like in the old days: stumbling across quaint cozy boutique websites made by hobbyists about some favorite topic. People just jump on social platforms and consume content until satisfied.
There is no money to be made anymore in mass web scraping at scale with impersonated clients, it’s all been consumed.
https://en.wikipedia.org/wiki/Next-Generation_Secure_Computi...
...and we're seeing the puzzle pieces fall into place. Mandated driver signing, TPMs, and more recently remote attestation. "Security" has always been the excuse --- securing their control over you.
For those less informed, add “to impersonate the fingerprints of a browser.”
One can, obviously, make requests without a browser stack.
Computers used to be empowering. Cryptography used to be empowering. Then these corporations started using both against us. They own the computers now. Hardware cryptography ensures the computers only run their software now, software that does their the corporation's bidding and enforces their controls. And if we somehow gain control of the computer we are denied every service and essentially ostracized. I don't think it will be long before we are banned from the internet proper for using "unauthorized" devices.
It's an incredibly depressing state of affairs. Everything the word "hacker" ever stood for is pretty much dying. It feels like there's no way out.
Absolutely. They might not care about individuals, though. It's their approach to shape "markets". The Apple, Google, Amazon, and Microsoft tax is not inevitable and that's their problem. They will fight toe and nail to keep you locked in, call it "innovation", and even cooperate with governments (which otherwise are their natural enemy in the fight for digital control). It's the people that a) don't care much and b) don't have any options.
In the end, a large share of our wealth is just pulled from us to these ever more ridiculous rent seeking schemes.
At the time I wrote this up, r1-api.rabbit.tech required TLS client fingerprints to match an expected value, and not much else: https://gist.github.com/DavidBuchanan314/aafce6ba7fc49b19206...
(I haven't paid attention to what they've done since so it might no longer be the case)
> I doubt sites without serious anti-bot detection will do TLS fingerprinting
They don't set it up themselves. CloudFlare offer such thing by default (?).
Is there a way to request impersonization of the current version of Chrome (or whatever)?
$ curl_chrome <TAB><TAB>
curl_chrome100
curl_chrome101
curl_chrome104
curl_chrome107
curl_chrome110
curl_chrome116
curl_chrome119
curl_chrome120
curl_chrome123
curl_chrome124
curl_chrome131
curl_chrome131_android
curl_chrome99
curl_chrome99_android
It's now pretty common to have cloudflare, AWS, etc WAFs as main endpoints, and these do anti bots (TLS fingerprinting, header fingerprinting, Javascript checks, capt has, etc).
Ultimately I was not able to get it to build because the BoringSSL disto it downloaded failed to build even though I made sure all of the dependencies the INSTALL.md listed are installed. This might be because the machine I was trying to build it on is an older Ubuntu 20 release.
Edit: Tried it on Ubuntu 22, but BoringSSL again failed to build. The make script did work better however, only requiring a single invocation of make chrome-build before blowing up.
Looks like a classic case of "don't ship -Werror because compiler warnings are unpredictable".
Died on:
/extensions.cc:3416:16: error: ‘ext_index’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
The good news is that removing -Werror from the CMakeLists.txt in BoringSSL got around that issue. Bad news is that the dependency list is incomplete. You will also need libc++-XX-dev and libc++abi-XX-dev where the XX is the major version number of GCC on your machine. Once you fix that it will successfully build, but the install process is slightly incomplete. It doesn't run ldconfig for you, you have to do it yourself.
On a final note, despite the name BoringSSL is huge library that takes a surprisingly long time to build. I thought it would be like LibreSSL where they trim it down to the core to keep the attack surface samll, but apparently Google went in the opposite direction.
The original repo was already full of hacks, and on top of that, I added more hacks to keep up with the latest browsers. The main purpose of my fork is to serve as a foundation of the python binding, which I think is easier to use. So I haven't tried to make the whole process more streamlined as long as it works on the CI. You can use the prebuilt binaries on the release page, though. I guess I should find some time to clean up the whole thing.
Worked around it by modifying the patch: https://github.com/jakeogh/jakeogh/blob/master/net-misc/curl...
Considering the complexity, this project, and it's upstream parent and grandparent(curl proper) are downright amazing.
The following browsers can be impersonated.
...unfortunately no Firefox to be seen.
I've had to fight this too, since I use a filtering proxy. User-agent discrimination should be illegal. One may think the EU could have some power to change things, but then again, they're also hugely into the whole "digital identity" thing.
You can find support for old firefox versions in the original repo.