I'm assuming the next step will be to bring to Cloudflare's pet project of TPM attestation into Chrome, otherwise known as PATs[1]. And just like that, not only would headless be defeated, but all of you using rooted devices and small time browsers would be left high and dry.
What stops someone from making a fake TPM that speaks the appropriate protocol and just instantly signs off on every request? AFAIK there isn't some grand/central list of trusted TPM modules. Anyone can implement one as a Linux driver: https://www.kernel.org/doc/html/latest/security/tpm/tpm_ftpm...
A fake TPM would be useless for security but just fine for fooling websites that there is a real human at the computer.
> Computer programs can use a TPM to authenticate hardware devices, since each TPM chip has a unique and secret Endorsement Key (EK) burned in as it is produced.
That EK is signed by the TPM manufacturer, and so it’s likely they’ll only trust the keys of physical TPM manufacturers. Good luck forging that in software.
I wonder if we'll get a cat-and-mouse game with miscellaneous TPM manufacturers "accidentally" leaking their keys, getting blacklisted, creating new ones, etc. I'd like to think that there's at least a nontrivial amount of the population wanting to subvert the authoritarian corporatocracy and with the skills to do so.
It's going to be an extremely janky or very private website if they only allow you to use it when you have 1 of like a dozen supported and approved hardware TPMs to view it.
The latest windows version requires a hardware tpm on a device in order to be installed. Every hardware vendor has therefore included a tpm on all their new machines. This was already standard on apple devices, and many android devices have one as well.
Sure but someone who wants to build a web scraper won't care, they could use their own homebrew TPM that does a no-op and claims a user pressed a button or was present when they actually were not there.
I doubt websites will go to the trouble to keep a list of approved TPMs. It's the SSL root certs nightmare all over again and even worse. No one is going to want to deal with managing a whole new giant list of devices, having fire drill updates to revoke compromised ones, etc.
What is the solution to automation then? What do I do when someone hits my content-rich Wordpress blog with a scraper that hits 100 pages a second to download my content, and my database falls over leading to real, legitimate users being unable to use my site? What if it’s not a legitimate scraper but someone with hundreds of proxies uses them to DDOS my site for days? Should I sacrifice my uptime to protect the freedom of those unwilling to attest that they’re running on real hardware?
The method to stop a (D)DoS is the same as it always was: caching and rate limiting.
Re: content scraping -- I was an indie web dev of a sort for a while and people always ask this question, and the answer is it's impossible to stop. Not even Facebook or big content sites like CNet or The Verge can stop it. At the bottom of it, you can just access the site in a browser and save the source. Content scraping is a rephrasing of "viewing content even just once". Stopping it is antithetical to the web and technologically infeasible.
it's probably actually cheaper to pay people piece rates to do it for you in a browser than to pay a developer to write and maintain a scraping script anyway, so if the later became genuinely impossible moving to the former isn't a big deal.
Put your WordPress blog behind a caching proxy with a 5s TTL - that way any amount of traffic to a URL will produce at most one hit every 5 seconds to your backend.
I've used this trick to survive surprise spikes of traffic in multiple projects for years.
Doesn't help for applications where your backend needs to be involved in serving every request, but WordPress blogs serving static content are a great example of something where that technique DOES work.
Proof-of-work schemes such as Hashcash[1] and simple ratelimiting algorithms can act as deterrents to spamming and scraping attacks.
There are other kinds of non-invasive bot management you can do as well, however, due to various reasons I'm not in a position to talk about it. A few other methods are mentioned at the end of the post being discussed[2].
Can privacy be preserved with zero knowledge proofs? I don't like the idea of universal fingerprinted devices in an already heavily authoritarian world.
Semantic quibble: it's less "proof of work" and more "proof of hardware+work". Or, as they call it, hardware-bound proof of work. The reason you can't offload the challenge to a more powerful device is that they rely on identifying stable differences for each device class that ultimately trace down to the hardware they're running on.
Wasn't mining in the browser basically shutdown by every major browser?
It was done super fast.. one can't help but think that Google pull all the levers they had at Apple/Mozilla to made sure the first viable alternative to advertisement was killed before it was born. But I think as a side effect it make PoW might be sort of impossible?
I don't really know how to mining "fingerprinting" works exactly - so would be curious to know if I'm wrong
What killed "mining in the browser", more than anything else, was:
1) It was almost exclusively used for malicious purposes. Very few legitimate web sites used cryptominers, and it was never considered a viable substitute for display advertising; it was primarily deployed on hacked web sites. Browser vendors were relatively slow to react; many of the first movers were actually antivirus/antimalware vendors adding blocks on cryptominer scripts and domains.
2) The most popular cryptominer scripts, like Coinhive, all mined the Monero coin. (Most other cryptocurrencies were impractical to mine without hardware acceleration.) Monero prices were at an all-time high at the time; when Monero prices crashed in late 2018, the revenue from running cryptominer scripts dropped dramatically, making these scripts much less profitable to run. (This is ultimately what led Coinhive to shut down.)
I guess slow/fast is subjective. It didn't seem like enough time passed for a legitimate ecosystem to develop. Just the basic idea of say hosting a static-site/blog on a VPS with a cryptominer that could pay for itself would have been a game changer - but was probably just the tip of the iceberg of possibilities. Instead we're still stuck either having to sell our traffic/info to Google/Microsoft, put up ads, pay for it out of pocket. The entrenched players won
The hacked site boogieman felt overblown (and from what you're saying it sounds like if would have died out anyway). I'm sure it happened, but at least personally I never once came across it. Or if I did, then my CPU spun a bit more and I didn't notice. No real harm done.
More fundamentally we're now in territory where the browser vendors get to decide what javascript is okay to run and which isn't.
Anyway, it's just complaining into the ether :) it is what it is. thanks for the context of the market forces and antivirus companies
> I guess slow/fast is subjective. It didn't seem like enough time passed for a legitimate ecosystem to develop.
Coinhive was live from 2017 - 2019, and it basically ran the whole course from exciting new tech to widely abused to dead over those two years. I don't think it needed more time.
> The hacked site boogieman felt overblown...
Troy Hunt acquired several of the Coinhive domains in 2021 -- two years after the service shut down -- and it was still getting hundreds of thousands of requests a day, mostly from compromised web sites and/or infected routers. It was a serious problem, albeit one which mostly affected smaller and poorly maintained web sites.
Make it someone else's problem; put a caching CDN in front of it, like Cloudflare, who have experience with these problems (like intentional or accidental DDOS).
I understand and agree with the suggestion of putting a CDN, but it's somewhat ironic to suggest the use of Cloudflare when that very same company is advocating for the DRM-for-webpages scheme.
Is it not a fair to assume that Cloudflare, as a company who have made a name for themselves selling various DDoS protection services, realize they're in an arms race with the old school way of handling these problems are are pursuing more advanced solutions before the current techniques are entirely useless?
It would be easy to point to the irony of saying "instead of supporting Cloudflare's proposals for PATs, use their CDN product for brute force protection" but on the other hand, they employ a lot of experts in this space and might see the writing on the wall in an increasingly adversarial public internet.
This is a good question, but if you look at it closely, Cloudflare seems to be the only company advocating for attestation schemes for the web.
It’s almost as if the conspiracy theory of Cloudflare acting as an arm of the US government and helping in the centralization of the internet is actually true.
is there such thing as a caching CDN that effectively protects against scrapers? generally if somebody is going to try and scrape a whole bunch of old infrequently-accessed but dynamically generated pages, most of those won't be in the cache and so the caching proxy isn't going to help at all.
i'm honestly asking, not just trying to disprove you. this is a real problem i have right now. ideally i'd get all my thousands of old, never-updated but dynamically generated pages moved over to some static host, but that's work and if i could just put some proxy in front to solve this for me i'd be pretty happy. but afaik, nothing actually solves this.
Akamai has a scraper filter (I think it just rate limits scrapers out of the box but can be configured to block if you want).
I'm not sure how good it is at detecting what is a scraper and what isn't though.
Yeah, AWS has one of these, a set of firewall rules called "bot control". it seems to work well enough for blocking the well-behaved bots who request pages at a reasonable rate and self-identify with user-agent strings (which i'm not really concerned about blocking, but it does give me some nice graphs about their traffic). it seem doesn't do a whole lot to block an unknown scraper hitting pages as fast as it can.
> What do I do when someone hits my content-rich Wordpress blog with a scraper that hits 100 pages a second to download my content, and my database falls over
It's a blog. Blogs are not complex. Why is your blog's database so awfully designed that 100 pages a second causes it to fall over?
> leading to real, legitimate users being unable to use my site?
You assume that a scraper is not a legitimate user. I argue otherwise. If you don't want a scraper to use your site then put your site behind a paywall.
> What if it’s not a legitimate scraper but someone with hundreds of proxies uses them to DDOS my site for days?
If it's a network bandwidth problem, then a reverse proxy (eg, CDN) solves that.
> Should I sacrifice my uptime to protect the freedom of those unwilling to attest that they’re running on real hardware?
All software runs on real hardware. What is your exact question?
I am accessing this site in a virtual machine. I could be doing it with a headless browser. Why does that matter at all?
It's "Right to read"[2] all over again.
[1] https://www.ietf.org/archive/id/draft-private-access-tokens-...
[2] https://www.gnu.org/philosophy/right-to-read.en.html