Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I dont think this is what most people think of when they say deduplication. There are quite a few systems which will just scan for duplicates and then automatically delete one of the duplicates. In such a system sha1 would be inappropriate.

If you are just using sha1 as a heuristic you dont fully trust, i suppose sha1 is fine. It seems a bit of an odd choice though as something like MurmurHash would be much faster for such a use case.



Most people. I haven't been part of that group, for like, ever maybe?

If we're a group of devs with a not insignificant percentage of those devs being frontend/UI/UX types, then having the same image in multiple sizes, formats, etc is going to be pretty common. Looking for multiples of the exact file is only going to reduce so much. Knowing you have a library of images with a source and then all of the derivatives is going to get you a lot less files as long as you know you have the source, then running image based sameness is much more beneficial. Sure, this is niche territory, but yeah, and, so?

Maybe there's someone new(-ish) that hasn't really had to deal with cleaning up thousands of images to this extent. One would hope the same image in its various forms within a dev's env would be similarly named, but that's not guaranteed. If we could depend on filenames, we wouldn't need hashing, right?


In some cases deduplication happens at the file system layer transparently without you even realizing it. E.g. there are tools like https://github.com/lakshmipathi/dduper

I agree that image editing workflows are a different use case more suited to perceptual hashes than cryptographic hashes.


SHA-1 cannot be trusted only when there is a possibility that both files whose hashes are compared have been created by an attacker.

While such a scenario may be plausible for a public file repository, so SHA-1 is a bad choice for a version control system like GIT, there are a lot of applications where this is impossible, so it is fine to use SHA-1.


I'm not sure what scenarios there are where you have a possibility of the attacker creating 1 file but not both. Especially because the attacker doesn't need to fully control both files but could control only a prefix of one of them and still do the attack.

I also think working out all the possibilities is really hard, and using sha256 is really easy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: