One thing I didn't get is this magical PII thing. How does the author look at a random network packet -- nay, just packet headers -- and assign a PII:true/false label? I think many corporations would sacrifice the right hand of a sysadmin if that was the way to get this tech.
The article just says:
> I wrote a small python program to scan the port 80 traffic capture and create a mapping from each four-tuple TLS connection to a boolean - True for connection with PII and False for all others.
Is it just matching against a list of source IPs? And perhaps the source port, to determine whether it comes from e.g. a network drive (NFS in this case)? Not sure what he uses the full four-tuple for, if this is the answer in the first place. It's very hand-wavy for what is an integral part of finding the intrusion and kind of a holy grail in other situations as well.
Amazon and Microsoft also have their own offerings, but can be quite expensive for network packets (and pretty slow).
Most projects / teams will use some basic regular expressions to capture basics like SSN, credit card numbers or phone numbers. They’re typically just strings of a specific length. More difficult if you’re doing addresses, names, etc.
That's great for you but... how's that relevant to the article? The author never speaks of using this sort of thing.
I saw these regex matchers in school but don't understand them. They go off all day long because one in a dozen numbers match a valid credit card number, even in the lab environment the default setup was clearly unusable. But perhaps more my point: who'd ever upload the stolen data plaintext anyhow? Unencrypted connections have not been the default for stolen data since... the 80s? If your developers are allowed to do rsync/scp/ftps/sftp/https-post/https-git-smart-protocol then so can I, and if they can't do any of the above then they can't do their work. Adding a mitm proxy is, aside from a SPOF waiting to happen, also very easily circumvented. You'd have to reject anything that looks high in entropy (so much for git clone and sending PDFs) and adding a few null bytes to avoid that trigger is also peanuts.
These appliances are snakeoil as far as I've seen. But then I very rarely see our customers use this sort of stuff, and when I do it's usually trivial to circumvent (as I invariably have to to do my work).
Now the repository you linked doesn't use regexes, it uses "a cutting edge pre-trained deep learning model, used to efficiently identify sensitive data". Cool. But I don't see any stats from real world traffic, and I also don't see anyone adding custom python code onto their mitm box to match this against gigabits of traffic. Is this a product that is relevant here, or more of a tech demo that works on example files and could theoretically be adapted? Either way, since it's irrelevant to what the author did, I'm not even sure if this is just spam.
> One thing I didn't get is this magical PII thing. How does the author look at a random network packet -- nay, just packet headers -- and assign a PII:true/false label? I think many corporations would sacrifice the right hand of a sysadmin if that was the way to get this tech.
Checkout Amazon macie or Microsoft presidio or try actually using the library I linked?
It’s usually used in a constrained way, in no way perfect. But it helps investigators track suspected cases of data exfiltration. You can pull something that looks suspect (say a credit card) and compare against an internal dataset and see if it’s legit.
In the repo I linked you can see the citation for an earlier model on synthetic and real world datasets:
My guess was that traffic containing PII was flagged in some way such that it was visible in the pre-GW traffic the researcher had access to. That was the point of linking up the pre-gateway and post-gateway packets. I'm not sure how common such setups are.
What's even more incredible to me is that the researcher somehow recreated exactly the same / correct traffic pattern on their local testing setup, so that they were able to compare the traffic with the production environment to detect that there was a problem. How would you do this?
I'm not even sure what the "time" variable is on the graphs. Response time? (It also seems weird that there's any PII on port 80, but that's an unrelated issue.)
> What's even more incredible to me is that the researcher somehow recreated exactly the same / correct traffic pattern on their local testing setup, so that they were able to compare the traffic with the production environment to detect that there was a problem.
Yeah, that's another thing that has me confused, but I figured one thing at a time...
Thanks for the response, that pre-set PII flag does sound plausible, though it's odd that they'd never mention it and mention a 'four-tuple' instead (sounds like they're trying to use terms not everyone knows? Idk, maybe it's more well-known than it seems to me).
Yes, that was the part where I got lost. It seems he skipped some details about that so it's not clear from the article how that was done. I can't imagine capturing the encrypted data got him that.
The article just says:
> I wrote a small python program to scan the port 80 traffic capture and create a mapping from each four-tuple TLS connection to a boolean - True for connection with PII and False for all others.
Is it just matching against a list of source IPs? And perhaps the source port, to determine whether it comes from e.g. a network drive (NFS in this case)? Not sure what he uses the full four-tuple for, if this is the answer in the first place. It's very hand-wavy for what is an integral part of finding the intrusion and kind of a holy grail in other situations as well.