I dont think this is what most people think of when they say deduplication. Ther...

dylan604 · on Dec 16, 2022

Most people. I haven't been part of that group, for like, ever maybe?

If we're a group of devs with a not insignificant percentage of those devs being frontend/UI/UX types, then having the same image in multiple sizes, formats, etc is going to be pretty common. Looking for multiples of the exact file is only going to reduce so much. Knowing you have a library of images with a source and then all of the derivatives is going to get you a lot less files as long as you know you have the source, then running image based sameness is much more beneficial. Sure, this is niche territory, but yeah, and, so?

Maybe there's someone new(-ish) that hasn't really had to deal with cleaning up thousands of images to this extent. One would hope the same image in its various forms within a dev's env would be similarly named, but that's not guaranteed. If we could depend on filenames, we wouldn't need hashing, right?

bawolff · on Dec 16, 2022

In some cases deduplication happens at the file system layer transparently without you even realizing it. E.g. there are tools like https://github.com/lakshmipathi/dduper

I agree that image editing workflows are a different use case more suited to perceptual hashes than cryptographic hashes.

adrian_b · on Dec 15, 2022

SHA-1 cannot be trusted only when there is a possibility that both files whose hashes are compared have been created by an attacker.

While such a scenario may be plausible for a public file repository, so SHA-1 is a bad choice for a version control system like GIT, there are a lot of applications where this is impossible, so it is fine to use SHA-1.

bawolff · on Dec 15, 2022

I'm not sure what scenarios there are where you have a possibility of the attacker creating 1 file but not both. Especially because the attacker doesn't need to fully control both files but could control only a prefix of one of them and still do the attack.

I also think working out all the possibilities is really hard, and using sha256 is really easy.