Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The game continues. Back in 2010 when I was writing the first in-browser bot detection signals for Google (so BotGuard could spot embedded Internet Explorers) I wondered how long they might last. Surely at some point embedded browsers would become undetectable? It never happened - browsers are so complex that there will probably always be ways to detect when they're being automated.

There are some less obvious aspects to this that matter a lot in practice:

1. You have to force the code to actually run inside a real browser in the first place, not simply inside a fast emulator that sends back a clean response. This is by itself a big part of the challenge.

2. Doing so is useful even if you miss some automated browsers, because adversaries are often CPU and RAM constrained in ways you may not expect.

3. You have to do something sensible if the User-Agent claims to be something obscure, old or alternatively, too new for you to have seen before.

4. The signals have to be well protected, otherwise bot authors will just read your JS to see what they have to patch next. Signal collection and obfuscation work best when the two are tightly integrated together.

These days there are quite a few companies doing JS based bot detection but I noticed from write-ups by reverse engineers that they don't seem to be obfuscating what they're doing as well as they could. It's like they heard that a custom VM is a good form of obfuscation but missed some of the reasons why. I wrote a bit about why the pattern is actually useful a month ago when TikTok's bot detector was being blogged about:

https://www.reddit.com/r/programming/comments/10755l2/revers...

tl;dr you want to use a mesh oriented obfuscation and a custom VM makes that easier. It's a means, not an end.

Ad: Occasionally I do private consulting on this topic, mostly for tech firms. Bot detectors tend to be either something home-grown by tech/social networking firms, or these days sold as a service by companies like DataDome, HUMAN etc. Companies that want to own their anti-abuse stack have to start from scratch every time, and often end up with something subpar because it's very difficult to hire for this set of skills. You often end up hiring people with a generic ML background but then they struggle to obtain good enough signals and the model produces noise. You do want some ML in the mix (or just statistics) to establish a base level of protection and to ensure that when bots are caught their resources are burned, but it's not enough by itself anymore. I offer training courses on how to construct high quality JS anti-bot systems and am thinking of maybe in future offering a reference codebase you can license and then fork. If anyone reading this is interested, drop me an email: mike@plan99.net



What are bots used for? I can think of a few reasons, wrote a scraper/submitter myself in the 90's for a cooperative of subcontractors that was being forced to use an extremely sluggish web app by the big company that provided their gigs.

But I guess there are all kind of purposes, some benign some nefarious, and that they somehow influence the bot operation and detection.


People are paying $500 for bots used to buy the latest Nike/Adidas/... limited edition sneakers. Or videocards a few years ago (for crypto mining).

It's a whole industry.

> If we consider a user base of ~175 users, and a minimum bot price of 200 euros (175 users x 200 euros), then the bot developers made at least 35K euros (~$37K USD) in initial bot sales.

https://datadome.co/threat-research/inside-sneaker-bot-busin...


Artificial scarcity in sneakers is their design decision. These shenanigans should have zero impact on browser policy.


I thought about building something like that for photographers to get gigs from large real-estate photography contractors who sub-contract the work to independent photographers. Automated tools would benefit the photographers greatly. The benefit comes at the expense of those not using automated tools, so the morality of such a tool is at least somewhat questionable.


"The signals have to be well protected, otherwise bot authors will just read your JS to see what they have to patch next. Signal collection and obfuscation work best when the two are tightly integrated together."

JS sounds like a bad match for this task. I perform similar checks from the backend with http headers and Python.

Is there a compelling reason to stick with JS despite the added complexity of obfuscation?

Edit: My use case is different than yours as it's part of a pid-free analytics application. However, bot detection is still an important component of that product.


If you're only relying on http headers, you're missing all but the most trivial of "bots". There are other things you could do with a backend-only approach but if your code doesn't run where the device connects to (e.g. you're behind a load balancer or other reverse proxy), those are largely unworkable.


"If you're only relying on http headers, you're missing all but the most trivial of bots"

Very true. Capturing, processing, and storing analytics data long-term is expensive. If I eliminate even 50% of that noise, the savings will be worth it.

I'm attempting to identify the bulk of bots with http headers and real-time session monitoring. I also have an unauthorized list (known bad actors) and an ignore list (search bots, etc.). It works pretty well but definitely doesn't begin address the problem as a whole (from a security perspective).

It's an interesting and complex topic.


Re: your ad.

This sounds like a solid product / startup idea to me. I worked on spambot detection in a previous job and it's not at all trivial to solve. Though we were specifically interested in detecting the abusive use of bots, not bots in general, so I focused simply on detecting unusual resource consumption rather than fingerprinting.


There are startups doing this sort of thing already, the article is written by the head of research at one. But tech firms often like to have their own in-house stack with the source code.


Yeah, but for non-tech companies, like Nike/Adidas as mentioned in other comments, they will need this kind of bot-detection services.



What do you mean by a "mesh-oriented obfuscation"? My best guess is: serving a different subset of the VM detection code to each client?


There's lots of techniques that can fall under that heading. The idea is to tie together your logic and obfuscation so that the things you have to do to undo the obfuscation end up breaking access to other parts of the program. Using the output of hash functions as decryption keys is one famous approach but there are others.


Heh, I had a feeling you'd show up here. Hi, Mike :)


Long time no see mate :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: