SHA-1 is still perfectly fine for some applications like detecting duplicate files on a storage medium (and it's less likely to produce a false positive than MD5) but it's been a bad idea for anything security related for a decade.
The biggest issue is that git still uses it, which presents a problem if you want to protect a repo from active integrity attacks.
Git no longer uses SHA-1. It instead uses a variant called SHA-1DC that detects some known problems, and in those cases returns a different answer. More info: <https://github.com/cr-marcstevens/sha1collisiondetection>. Git switched to SHA-1DC in its version 2.13 release in 2017. It's a decent stopgap but not a grrat long term solution.
The fundamental problem is that get developers assumed that hash algorithms would never be changed, and that was a ridiculous assumption. It's much wiser to implement crypto agility.
> The fundamental problem is that get developers assumed that hash algorithms would never be changed, and that was a ridiculous assumption. It's much wiser to implement crypto agility.
Cryptographic agility makes this problem worse, not better: instead of having a "flag day" (or release) where `git`'s digest choice reflects the State of the Art, agility ensures that every future version of `git` can be downgraded to a broken digest.
That's the general anti-agility argument wielded against git, but note that git's use cases require it to process historic data.
E.g. you will want to be able to read some sha-1-only repo from disk that was last touched a decade ago. That's a different thing than some protocol which requires both parties to be on-line, say wireguard, in which instance it's easier to switch both to a new version that uses a different cryptographic algorithm.
Git has such protocols as well, and maybe it can deprecate sha-1 support there eventually, but even there it has to support both sha-1 and sha-2 for a while because not everyone is using the latest and greatest version of git, and no sysadmin wants the absolute horror of flag days.
Assuming reasonable logic around hashes, like "a SHA-2 commit can't be a parent of a SHA-1 commit", there wouldn't much in the way of downgrade attacks available.
Wow, smart! This would keep all the old history intact and at the same time force lots of people to upgrade through social pressure. I'd probably be angry as hell when that happened to me, but it would also work.
FTR the current plan for git's migration is that commits have both SHA-1 and SHA-2 addresses, and you can reference them by both. There is thus no concept of "SHA-2 commit", or "SHA-1 commit". The issue is more around pointers that are not directly managed by git, e.g. hashes inside commit messages to reference an earlier commit (and of course signatures). Those might require a git filter-repo - like step that breaks the SHA-1 hashes (and signatures) to migrate to SHA-2, if that is desired.
SHA-1 was already known to be broken at the time Git chose it, but they chose it anyway. Choosing a non-broken algorithm like SHA-2 was an easy choice they could have made that would still hold up today. Implementing a crypto agility system is not without major trade-offs (consider how common downgrade attacks have been across protocols!).
> Choosing a non-broken algorithm like SHA-2 was an easy choice they could have made that would still hold up today.
Yet, the requirement of the hashing algorithm for Git is not broken, it's not cryptographic but merely stochastic, and Linus knows this.
Why bother to produce a collision, when you have the power to get your changes pulled into a release branch? Your attack might be noticed, and your cover blown.
Instead, simply try to get a bug
merged that results in a zero day. In case somebody discovers it, at least you have plausible deniability that it happened on accident.
Since about 2005, collision attacks against SHA-1 have been known. In 2005 Linus dismissed these concerns as impractical, writing:
> The basic attack goes like this:
>
> - I construct two .c files with identical hashes.
Ok, I have a better plan.
- you learn to fly by flapping your arms fast enough
- you then learn to pee burning gasoline
- then, you fly around New York, setting everybody you see on fire, until
people make you emperor.
Sounds like a good plan, no?
But perhaps slightly impractical.
Now, let's go back to your plan. Why do you think your plan is any better
than mine?
This is a really good example of Torvalds toxic attitude and absolutely horrific attitude towards security. This is an occurring pattern unfortunately.
Git not being prepared for this is going to cost a lot of time and money for a very large amount of people, and it could have been trivially mitigated if security were taken seriously in the first place, and if Torvalds was mature enough to understand the he is not an expert on cryptography topics.
I didn't know either. From Wikipedia [1], SHA-1 has been considered insecure to some degree since 2005. Following the citations, apparently it's been known since at least August 2004 [2] but maybe not demonstrated in SHA-1 until early 2005.
git's first release was in 2005, so I guess technically SHA-1 issues could've been known or suspected during development time.
More generously, it could've been somewhat simultaneous. It sounds like it was considered a state-sponsored level attack at the time, if collisions were even going to be possible. Don't know if the git devs knew this and intentionally chose it anyway, or just didn't know.
If there's a readily-avaliable blob of C code that does the operation, then by definition it must be described somewhere. Maybe you should get ChatGPT to describe what it does.
> SHA-1 is still perfectly fine for some applications like detecting duplicate files on a storage medium
If by “perfectly fine” you mean “subject to attacks that generate somewhat targeted collisions that are practical enough that people do them for amusement and excuses to write blog posts and cute Twitter threads”, then maybe I agree.
Snark aside, SHA-1 is not fine for deduplication in any context where an attacker could control any inputs. Do not use it for new designs. Try to get rid of it in old designs.
By “perfectly fine” they mean detecting duplicate image or document files on your local storage, which it’s still perfectly fine for, and a frequent mode of usage for these types of tools.
Not every tool needs to be completely resilient to an entire Internets’ worth of attacks.
Deduplication is the kind of application where CRC is a decent approach and CRC has no resistance to attack whatsoever. SHA1 adds the advantage of lower natural collision probability while still being extremely fast. It's important to understand that not all applications of hashing are cryptographic or security applications, but that the high degree of optimization put into cryptographic algorithms often makes them a convenient choice in these situations.
These types of applications are usually using a cryptographic hash as one of a set of comparison functions that often start with file size as an optimization and might even include perceptual methods that are intentionally likely to produce collisions. Some will perform a byte-by-byte comparison as a final test, although just from a performance perspective this probably isn't worth the marginal improvement even for hash functions in which collisions are known to occur but vanishingly rare in organic data sets (this would include for example MD5 or even CRC at long bit lengths, but the lack of mixing in CRC makes organic collisions much more common with structured data).
SHA2 is significantly slower than SHA1 on many real platforms, so given that intentional collisions are not really part of the problem space few users would opt for the "upgrade" to SHA2. SHA1 itself isn't really a great choice because there are faster options with similar resistance to accidental collisions and worse resistance to intentional ones, but they're a lot less commonly known than the major cryptographic algorithms. Much of the literature on them is in the context of data structures and caching so the bit-lengths tend to be relatively small in that more collision-tolerant application and it's not always super clear how well they will perform at longer bit lengths (when capable).
Another way to consider this is from a threat modeling perspective: in a common file deduplication operation, when files come from non-trusted sources, someone might be able to exploit a second-preimage attack to generate a file that the deduplication tool will errantly consider a duplicate with another file, possibly resulting in one of the two being deleted if the tool takes automatic action. SHA1 actually remains highly resistant to preimage and second preimage attacks, so it's not likely that this is even feasible. SHA1 does have known collision attacks but these are unlikely to have any ramifications on a file deduplication system since both files would have to be generated by the adversary - that is, they can't modify the organic data set that they did not produce. I'm sure you could come up with an attack scenario that's feasible with SHA1 but I don't think it's one that would occur in reality. In any case, these types of tools are not generally being presented as resistant to malicious inputs.
If you're working in this problem space, a good thing to consider is hashing only subsets of the file contents, from multiple offsets to avoid collisions induced by structured parts of the format. This avoids the need to read in the entire file for the initial hash-matching heuristic. Some commercial tools initially perform comparisons on only the beginning of the file (e.g. first MB) but for some types of files this is going to be a lot more collision prone than if you incorporate samples from regular intervals, e.g. skipping over every so many storage blocks.
who is attacking you in this situation though? you're scanning the files on your local system and storing their hashes. you then look for duplicate hashes, and compare the files that created them. if the files are truly duplicates, you can now decide what to do about that. if they are not truly the same, then you claim to have found another case of collisions, write your blog post/twitthread and move on, but does that constitute being attacked?
sometimes, i really feel like people in crypto just can't detach themselves enough to see that just because they have a hammer, not everything in the world is a nail.
Why would you pick a function that is known to have issues when there are other functions that do the same thing but don't have known issues?
Your comparison is flawed. It's more like if you have a nail and next to it a workbench with two hammers - a good hammer and a not as good hammer. This isn't a hard choice. But for reasons that are unclear to me, people in this thread are insisting on picking the less good hammer and rationalizing why for this specific nail it isn't all that much worse. Just pick the better hammer!
Because people already have two decades of SHA-1 hashes in their database and a rewrite + rescan is completely pointless? Hell, I have such a system using md5. So you produced a hash collision, cool, now fool my follow-on byte-by-byte comparison.
Edit: Before anyone lecture me on SHA-1 being slow, yes, I use BLAKE2 for new projects.
You could just discard half the sha256 hash. Using the first 16 bytes of sha256 is a lot more secure than using just md5, in which case you might as well just use crc32.
Your question is irrelevant. If you don't care about security, SHA1 is a bad choice because there are faster hash functions out there. If you do care about security, SHA1 is a bad choice because it has known flaws and there exist other algorithms that don't. The only valid reason to use SHA1 is if there is a historical requirement to use it that you can't reasonably change.
Any analysis about how hard it is for an attacker to get a file on your local file system via a cloned got repo, cached file, email attachment, image download, shared drive, etc is just a distraction.
I dont think this is what most people think of when they say deduplication. There are quite a few systems which will just scan for duplicates and then automatically delete one of the duplicates. In such a system sha1 would be inappropriate.
If you are just using sha1 as a heuristic you dont fully trust, i suppose sha1 is fine. It seems a bit of an odd choice though as something like MurmurHash would be much faster for such a use case.
Most people. I haven't been part of that group, for like, ever maybe?
If we're a group of devs with a not insignificant percentage of those devs being frontend/UI/UX types, then having the same image in multiple sizes, formats, etc is going to be pretty common. Looking for multiples of the exact file is only going to reduce so much. Knowing you have a library of images with a source and then all of the derivatives is going to get you a lot less files as long as you know you have the source, then running image based sameness is much more beneficial. Sure, this is niche territory, but yeah, and, so?
Maybe there's someone new(-ish) that hasn't really had to deal with cleaning up thousands of images to this extent. One would hope the same image in its various forms within a dev's env would be similarly named, but that's not guaranteed. If we could depend on filenames, we wouldn't need hashing, right?
In some cases deduplication happens at the file system layer transparently without you even realizing it. E.g. there are tools like https://github.com/lakshmipathi/dduper
I agree that image editing workflows are a different use case more suited to perceptual hashes than cryptographic hashes.
SHA-1 cannot be trusted only when there is a possibility that both files whose hashes are compared have been created by an attacker.
While such a scenario may be plausible for a public file repository, so SHA-1 is a bad choice for a version control system like GIT, there are a lot of applications where this is impossible, so it is fine to use SHA-1.
I'm not sure what scenarios there are where you have a possibility of the attacker creating 1 file but not both. Especially because the attacker doesn't need to fully control both files but could control only a prefix of one of them and still do the attack.
I also think working out all the possibilities is really hard, and using sha256 is really easy.
SHA-1 is implemented in hardware in all modern CPUs and it is much faster than any alternatives (not all libraries use the hardware instructions, so many popular programs compute SHA-1 much more slowly than possible; OpenSSL is among the few that use the hardware).
When hashing hundreds of GB or many TB of data, the hash speed is important.
When there are no active attackers and even against certain kinds of active attacks, SHA-1 remains secure.
For example, if hashes of the files from a file system are stored separately, in a secure place inaccessible for attackers (or in the case of a file transfer the hashes are transferred separately, through a secure channel), an attacker cannot make a file modification that would not be detected by recomputing the hashes.
Even if SHA-1 remains secure against preimage attacks, it should normally be used only when there are no attackers, e.g. for detecting hardware errors a.k.a. bit rotting, or for detecting duplicate data in storage that could not be accessed by an attacker.
While BLAKE 3 (not BLAKE 2) can be much faster than SHA-1, all the extra speed is obtained by consuming proportionally more CPU resources (extra threads and SIMD). When the hashing is done in background, there is no gain by using BLAKE 3 instead of SHA-1, because the foreground tasks will be delayed by the time gained for hashing.
Only when a computer does only hashing, BLAKE 3 is the best choice, because the hash will be computed in a minimal time, by fully using all the CPU cores.
> all the extra speed is obtained by consuming proportionally more CPU resources (extra threads and SIMD)
If you know you have other threads that need to do work, then yes, multithreading BLAKE3 would just pointlessly compete with those other threads. But I don't think the same is true of SIMD. If your process/thread isn't using vector registers, it's not like some other thread can borrow them. They just sit idle. So if you can make use of them to speed up your own process, there's very little downside. AVX-512 downclocking is the most notable exception, and you'd need to benchmark your application to see whether / how much that hurts you. But I think in most other cases, any power draw penalty you pay for using SIMD is swamped by the race-to-idle upside. (I don't have much experience measuring power, though, and I'd be happy to get corrected by someone who knows more.)
DO NOT USE SHA-1 UNLESS IT’S FOR COMPATIBILITY. NO EXCUSES.
With that out of the way: SHA-1 is not even particularly fast. BLAKE2-family functions are faster. Quite a few modern hash functions are also parallelizable, and SHA-1 is not. If for some reason you need something faster than a fast modern hash, there are non-cryptographic hashes and checksums that are extraordinarily fast.
If you have several TB of files, and for some reason you use SHA-1 to dedupe them, and you later forget you did that and download one of the many pairs of amusing SHA-1 collisions, you will lose data. Stop making excuses.
> there are non-cryptographic hashes and checksums that are extraordinarily fast.
Is it still true that CRC32 is only about twice as fast as SHA1?
Yeah I know the XX hashes are like 30 times faster than SHA1.
A lot depends on instruction set and processor choice.
Maybe another way to put it is I've always been impressed that on small systems SHA1 is enormously longer but only twice as slow as CRC32.
For a lot of interoperable-maxing non-security non-crypto tasks, CRC32 is not a bad choice, if its good enough for Ethernet, zmodem, and mpeg streams its good enough for my telemetry packets LOL. (IIRC iSCSI uses some variant different formulae)
For files, it is useless. Even if that was expected, I have computed CRC32 for all the files on an SSD. Of course, I have found thousands of collisions.
Birthday-style collisions don't matter for integrity checking.
32 bits is too small to do the entire job of duplicate detection, but if it's fast enough then you can add a more thorough second pass and still save time.
I believe the GP's point hinges on the word "attacker". If you aren't in a hostile space, like just your won file server and you are monitoring your own backups it's fine. I still use MD5s to version my own config files. For personal use in non-hostile environments these hashes are still perfectly fine.
> SHA-1 is still perfectly fine for some applications like detecting duplicate files on a storage medium
That's what the developers of subversion thought, but they didn't anticipate that once colliding files were available people would commit them to SVN repos as test cases. And then everything broke: https://www.bleepingcomputer.com/news/security/sha1-collisio...
That changes the parameters quite a bit though. For local digests, like image deduplication of your own content, on your own computers, sha-1 is still perfectly fine. Heck, even MD5 is still workable (although more prone to collide). Nowhere in that process is the internet, or "users" or anything else like that involved =)
You use digests to quickly detect potential collisions, then you verify each collision report, then you delete the actual duplicates. Human involvement still very much required because you're curating your own data.
If we're talking specifically image deduplication, then a hash comparison is only going to find you exact matches. what about the image deduplication of trying to find alt versions of things like scaling, different codecs, etc?
if you want to dedupe images, some sort of phashing would be much better so that the actual image is considered vs just the specific bits to generate the image.
Depends on the images. For photographs, a digest is enough. For "random images downloaded from the web", or when you're deduplicating lots of user's data, sure, you want data appropriate digests, like SIFT prints. But then we're back to "you had to bring network content back into this" =)
Very true but if the hash matches the images are guaranteed to match too. That's my first pass when deduping my drives. My second pass is looking for "logically equivalent" images.
If you need a shorter hash just truncate a modern hash algorithm down to 160 or 128 bits. Obviously the standard lengths were chosen for a reason, but SHA2-256/160 or SHA2-256/128 are better hash functions than SHA1 or MD5, respectively. Blake2b/160 is even faster than SHA1!
(I suspect this would be a good compromise for git, since so much tooling assumes a 160 bit hash, and yet we don't want to continue using SHA1)
Just as a note, the primary reason for the truncated variants is not to get a shorter hash but to prevent extension attacks. For variants without truncation, the final hash is the entire internal state, therefore an attacker can calculated the hash of any message that starts with the original message and then has additional content without knowing the original message. Truncating the hash denies access to the complete internal state and makes this impossible.
Another way to prevent extension attacks is to make the internal state different whenever the current block is the last block, as done for instance in BLAKE3 (which has as an additional input on each block a set of flags, and one of the flags says "this is the last block").
Git has already implemented a solution based on SHA-2 with 256 bit output so that's unlikely to be changed for the time being. (But it has not really been launched in earnest, only as a preview feature.)
As an industry we need to get over this pattern of scoping down usage of something that has failed it’s prime directive. People still use MD5 in secure related things because it’s been allowed to stick around without huge deprecation warnings in libraries and tools.
SHA1 (and MD5) need to be treated the same way you would treat O(n^2) sorting in a code review for a PR written by a newbie.
“We recommend that anyone relying on SHA-1 for security migrate to SHA-2 or SHA-3 as soon as possible.” —Chris Celi, NIST computer scientist
The emphasis being on "for security"
I've also used SHA-1 over the years for binning and verifying file transfer success, none of those are security related.
Sometimes, if you make a great big pile of different systems, what's held in common across them can be weird, SHA-1 popped out of the list so we used it.
I'm well aware its possible to write or automate the writing of dedicated specialized "perfect" hashing algos to match the incoming data, to bin the data more perfectlyier, but sometimes its nice if wildly separate systems all bin incoming data the same highly predictable way thats "good enough" and "fast enough".
Verified as in "is this file completely transferred or not?"
non-security critical data, I just want a general idea if its valid or the file transfer failed half way thru or the thing sending it went bonkers and just sent trash to us.
Another funny file transfer use: Send me a file of data every hour. Is the non-crypto-hash new or the same old hash? If its the same old hash, those clowns sent me the same file twice, I'm supposed to get a new one. Yes I know I can dedupe "easily" but not as "easily" as sha-1. And some application layer software like MySQL can directly generate SHA1 as a function in the query. Its really quite handy sometimes!
SHA-1 is still perfectly fine for some applications like detecting duplicate files on a storage medium
Absolutely agree, especially when speed is a workable trade-off and accepting real world hash collisions are unlikely and perhaps an acceptable risk. For financial data, especially files not belonging to me I would have md5+sha1+sha256 checksums and maybe even GPG sign a manifest of the checksums ... because why not. For my own files md5 has always been sufficient. I have yet to run into a real world collision.
FWIW anyone using `rsync --checksum` is still using md5. Not that long ago I think 2014 it was using md4. I would be surprised if rsync started using anything beyond md5 any time soon. I would love to see all the checksum algorithms become CPU instruction sets.
Optimizations:
no SIMD-roll, no asm-roll, no openssl-crypto, asm-MD5
Checksum list:
md5 md4 none
Compress list:
zstd lz4 zlibx zlib none
Daemon auth list:
md5 md4
Even if git didn't have protection against the known attack, it's still safe in practice.
The SHA-1 collision attack can only work if you take a specially-crafted file from the attacker and commit it to your repository. The file needs to have a specific structure, and will contain binary data that looks like junk. It can't look like innocent source code. If you execute unintelligible binary blobs from strangers, you're in trouble anyway.
There is no preimage weakness in SHA-1, so nobody is able to change or inject new data to an arbitrary repo/commit that doesn't already contain their colliding file.
I don't think so, unless you utilize some not-yet-public vulnerability.
As far as I know, with current public SHA-1 vulnerabilities, you can create two new objects with the same hash (collision attack), but cannot create a second object that has the same hash as some already existing object (preimage attack).
My bad, yep you're right. So you could only either give 2 people different git repos that should be the same or I guess you could submit a collided file into a repo you can submit changes to (eg a public one that accepts PRs) and give someone else the other version.
I feel like the properties of CRC make them superior for that task in most cases though. (CRC8, CRC16, CRC32 and CRC64, maybe CRC128 if anyone ever bothered going that far)
In particular, CRC guarantees detection on all bursts of the given length. CRC32 protects vs all bursts of length 32 bits.
> I feel like the properties of CRC make them superior for that task in most cases though.
THIS IS FALSE. Please do not ever do this. Why not? For example, by controlling any four contiguous bytes in a file, the resultant 32bit CRC can be forced to take on any value. A CRC is meant to detect errors due to noise - not changes due to a malicious actor.
Program design should not be done based upon one's feelings. CRCs absolutely do not have the required properties to detect duplication or to preserve integrity of a stored file that an attacker can modify.
> THIS IS FALSE. Please do not ever do this. Why not? For example, by controlling any four contiguous bytes in a file, the resultant 32bit CRC can be forced to take on any value.
And SHA1 is now broken like this, with collisions and so forth. Perhaps it's not as simple as just 4 bytes, but the ability to create collisions is forcing this retirement.
If adversarial collisions are an issue, then MD5 and SHA1 are fully obsolete now. If you don't care for an adversary, might as well use the cheaper, faster CRC check.
------
CRC is now more valid use case than SHA1. That's the point of this announcement.
> The point of the announcement is to give a timeline to start treating SHA1 as having no real security.
That's also false. There is a large body of knowledge here that you aren't expressing in your comments. That leads me to see that you are unfamiliar with the purposes of hash functions and their utility in real world situations.
The announcement refers to the transition timeline to stop using SHA-1, preferring the SHA-2 and SHA-3 families. However, the recommendations for years from NIST have been not to use SHA-1. For example SP 800-131Ar2 discusses not to use SHA-1 for digital sig gen and that digital sig ver is only acceptable for legacy uses.
The recommendation would have been for years to not use SHA-1 at all, except for this carve-out to handle already stored data that uses SHA-1. The remaining use cases cover protocol use, such as TLS, where SHA-1 is used as a component in constructs and not solely as a primitive.
There are only a few examples of anything larger than CRC64 being characterized and they're not very useful.
For the sake of the next person who has to maintain your code though, please choose algorithms that adequately communicate your intentions. Choose CRCs only if you need to detect random errors in a noisy channel with a small number of bits and use a length appropriate to the intended usage (i.e. almost certainly not CRC64).
And when you choose SHA1, does it mean you understood that it's no longer secure? Or is it chosen because it was secure 20 years ago but the code is old and needs to be updated?
CRC says that you never intended security from the start. It's timeless, aimed to prevent burst errors and random errors.
--------
BTW, what is the guaranteed Hamming distance between SHA1? How good is SHA1 vs burst errors? What about random errors?
Because the Hamming distances of CRC have been calculated and analyzed. We actually can determine, to an exact level, how good CRC codes are.
If you are choosing between a CRC and SHA1, you probably need to reconsider your understanding of the problem you are trying to solve. Those algorithms solve different use cases.
If you are choosing SHA1, now that it is retired, you probably should think harder about the problem in general.
CRC should be better for any error detection code issue. Faster to calculate, more studied guaranteed detection modes, and so forth.
SHA1 has no error detection studies. It's designed as a cryptographic hash, to look random. As it so happens, it is more efficient to use other algorithms and do better than random if you have a better idea of how your error looks like.
Real world errors are either random or bursty. CRC is designed for these cases. CRC detects the longest burst possible for it's bitsize.
You shouldn't choose SHA-1, that's the point of this announcement. Seeing it indicates both that there was the potential for malicious input and that the code is old. The appropriate mitigation is to move to a secure hash, not CRCs. You may not know the bounds and distances exactly, but you know them probabilistically. Bit errors almost always map to a different hash.
The same is true of CRCs over a large enough input as an aside.
Basically after ~10 rounds the output is always indistinguishable from randomness which means hamming distance is what you'd expect (about half the bits differ) between the hashes of any two bitstreams.
The biggest issue is that git still uses it, which presents a problem if you want to protect a repo from active integrity attacks.