That's great for you but... how's that relevant to the article? The author never speaks of using this sort of thing.
I saw these regex matchers in school but don't understand them. They go off all day long because one in a dozen numbers match a valid credit card number, even in the lab environment the default setup was clearly unusable. But perhaps more my point: who'd ever upload the stolen data plaintext anyhow? Unencrypted connections have not been the default for stolen data since... the 80s? If your developers are allowed to do rsync/scp/ftps/sftp/https-post/https-git-smart-protocol then so can I, and if they can't do any of the above then they can't do their work. Adding a mitm proxy is, aside from a SPOF waiting to happen, also very easily circumvented. You'd have to reject anything that looks high in entropy (so much for git clone and sending PDFs) and adding a few null bytes to avoid that trigger is also peanuts.
These appliances are snakeoil as far as I've seen. But then I very rarely see our customers use this sort of stuff, and when I do it's usually trivial to circumvent (as I invariably have to to do my work).
Now the repository you linked doesn't use regexes, it uses "a cutting edge pre-trained deep learning model, used to efficiently identify sensitive data". Cool. But I don't see any stats from real world traffic, and I also don't see anyone adding custom python code onto their mitm box to match this against gigabits of traffic. Is this a product that is relevant here, or more of a tech demo that works on example files and could theoretically be adapted? Either way, since it's irrelevant to what the author did, I'm not even sure if this is just spam.
> One thing I didn't get is this magical PII thing. How does the author look at a random network packet -- nay, just packet headers -- and assign a PII:true/false label? I think many corporations would sacrifice the right hand of a sysadmin if that was the way to get this tech.
Checkout Amazon macie or Microsoft presidio or try actually using the library I linked?
It’s usually used in a constrained way, in no way perfect. But it helps investigators track suspected cases of data exfiltration. You can pull something that looks suspect (say a credit card) and compare against an internal dataset and see if it’s legit.
In the repo I linked you can see the citation for an earlier model on synthetic and real world datasets:
I saw these regex matchers in school but don't understand them. They go off all day long because one in a dozen numbers match a valid credit card number, even in the lab environment the default setup was clearly unusable. But perhaps more my point: who'd ever upload the stolen data plaintext anyhow? Unencrypted connections have not been the default for stolen data since... the 80s? If your developers are allowed to do rsync/scp/ftps/sftp/https-post/https-git-smart-protocol then so can I, and if they can't do any of the above then they can't do their work. Adding a mitm proxy is, aside from a SPOF waiting to happen, also very easily circumvented. You'd have to reject anything that looks high in entropy (so much for git clone and sending PDFs) and adding a few null bytes to avoid that trigger is also peanuts.
These appliances are snakeoil as far as I've seen. But then I very rarely see our customers use this sort of stuff, and when I do it's usually trivial to circumvent (as I invariably have to to do my work).
Now the repository you linked doesn't use regexes, it uses "a cutting edge pre-trained deep learning model, used to efficiently identify sensitive data". Cool. But I don't see any stats from real world traffic, and I also don't see anyone adding custom python code onto their mitm box to match this against gigabits of traffic. Is this a product that is relevant here, or more of a tech demo that works on example files and could theoretically be adapted? Either way, since it's irrelevant to what the author did, I'm not even sure if this is just spam.