> "Intel doesn't have confidence in the drive at that point, so the 335 Series is designed to shift into read-only mode and then to brick itself when the power is cycled."
I don't understand why Intel wouldn't just configure these drives to go into read-only mode permanently. If I realized my hard drive had become read-only and didn't suspect hard drive failure, my first inclination would be to reboot my computer, not immediately back up all data.
The article is wrong on this point, and on Intel's intentions, as far as I can tell. Intel has a "Supernova" feature (http://itpeernetwork.intel.com/data-integrity-in-solid-state...) which will cause some drive models to brick themselves if certain conditions are met - errors in the control path, for example, which basically mean you cannot trust the drive at all. The supernova feature is only claimed for enterprise drives, and the 335 series is not an enterprise drive.
I have a lot of experience with long-running Intel SSDs of various models, including pushing them to the same kinds of extreme that the SSD endurance experiment did, and I have never observed them to self-brick simply because they reached their flash endurance point.
What I have observed is a number of firmware bugs (or possibly just the supernova feature) that caused the drive to brick on power cycle, even for drives in perfect health.
I liked the SSD endurance articles, because they went a long way to allaying fears about SSDs, but I think it's a shame they've left this point in.
My data point was that I got an Intel SSD back in 2010 (one of the first affordable ones, $500 for 160 GB) and it started showing bad sectors in 2013. I immediately copied all my data off, and sent it to intel, who sent me a replacement for free. The replacement has been working fine ever since.
I don't know Intel's reasoning for this policy, but if there's a sound technical reason for it, I would guess that it has to do with the drive not wanting to flush its NAND mapping information from DRAM to flash that it has deemed worn out. However, the Intel 335 Series uses SandForce controllers that don't have an external DRAM buffer, so they never have much data cached or in flight. It's more likely this policy was decided upon for enterprise products and was deemed not worth revising for client products given how few customers would exhaust the drive's write endurance to be affected by this.
EDIT: And, as pointed out by cuchulain, much of the information about the intended end-of-life behavior of Intel's SSDs is unreliable; they don't publish that information on a per-model basis, so some of what you read is based on mere speculation.
I remember having a contradictory discussion at work about the design of a emulated eeprom driver for some embedded product. We had the flash memory hardware rated for a number of erase/write cycles. The question was what should we do when the cycle number is greater than that rated number. I said we should continue functioning and eventually raise a warning or something. But my colleagues were saying we should simply kill the hardware and brick ourselves. I was adamantly against this but they were citing safety concerns that maybe the flash could get corrupted and we weren't supposed to support that long of a lifetime anyway. We had a lot of safety mechanisms and redundancies baked in so data corruption would not happen. And my argument was mainly that, yes, if unrecoverable data corruption happens, brick it, but until the hardware forces you to close shop, the SW should continue running as long as possible. I don't know what they ended up implementing because I soon left, but I think they went with the self-bricking option.
Anyways, I just wanted to share this nice anecdote and I can't help but think that maybe somewhere, some Intel engineers had some discussion very similar to my own.
When I read that I also thought "That's horrible, guess I won't buy that drive." When I read further though, I discovered that all the drives in his test become unreadable ("bricked"?) when they eventually failed.
Well for the others, if you really care about it you can see that sectors start getting remapped and think "ah ok time to start thinking about backing this data up and replace the drive" whereas if I understand correctly on the Intel one you pretty much immediately need to backup the data and hope that you don't need to restart or lose power before you've backed up what you need.
I'm sure it's more nuanced than that but my reaction was definitely "steer clear of the Intel drives ..." when I read this so perhaps someone can clarify.
Backup shouldn't be something to do when the drive is exhaling its last breath or even showing first symptoms, it should be done often and in a transparent way. On a laptop the best practice is to arrange a sync with a server (NAS etc.) when one gets home. If done incrementally it requires from seconds to minutes and is fully automatic. Unfortunately making backups still isn't common practice; most users see a NAS or even an external drive as wasted money. They feel safe by "backing up" some data on a USB key only to discover how volatile and unsafe it might be when it's too late (breaks, washing machine, theft, loss, etc.)
Continuous Integration systems can really burn through SSD endurance. If you have a large, compiled code base which rebuilds on every checkin, you will be creating and deleting object code constantly. Use smartmontools or HDD Guardian to keep an eye on endurance.
Our code base creates around half a gig of compilation product on every build. We used up the endurance on a consumer-level Micron SSD in about a year. No data loss occurred.
Indeed, my Ubuntu recently created multiple 22G log files several days in a row (some USB issue or other fixed by updating kernel). Wouldn't have been an issue but the disk was nearly full.
Maybe there's room for a 'file write filter' that avoids writing identical data back to the same file. To save SSD lifetime. Sounds like it would have application.
There's also something to be said for having a build system that can correctly do incremental rebuilds and caching of outputs, which could massively ease the SSD write load.
You guys are all adorable software devs :-). You're trying to solve a problem with software that just isn't that big a deal: high endurance drives exist, or just buy a new one every year.
That's cute. Require maintenance where none is actually needed. Everyone hosting a CI now needs a hardware guy, too.
Bandaid solutions (replace every x months, buy something bigger/faster, etc) are not the way to go. The hardware solution to this is not buy a high-endurance drive but to buy more RAM and set up a tmpfs build directory - or buy a ram drive and use that for build instead of you want to eliminate even that software configuration step.
There is no "solution" to speak of, tmpfs can be mounted on any directory on linux, so there is no difference to normal build process at least on linux. Also for example I can serialise build jobs in jenkins, so the total amount of space required will be spread over time.
RAM is more expensive than SSDs but it's still very cheap at roughly $5 per GB. Higher density 32GB DIMMs cost only a bit more at $7 per GB. You can buy terabytes of RAM for a few thousand dollars.
Well, yeah. Reproducible builds are valuable. If you don't think so, then have fun trying to rollback to a commit that can only be built by building a specific set of commits leading up to it in the right order.
These are all just ratings though. The theory is that over a population of drives, you'll see a higher failure rate than predicted if you do higher than the rated workload per year. WDC used to have a whitepaper on it called "Why Specify Workload", but it's no longer on their site.
I have in some cases seen enterprise sata drives pushed to the kinds of workload you're talking about - 2.5PB in a year - and seen in the order of 10% fail over that time, with a drive that normally has a ~0.5% AFR.
80MB/s sequential reads or writes is probably something consumer HDDs can survive for several years. The platters are always spinning, the only difference is that now the drive is continuously reading or writing what's under the head. It's the random accesses (and associated seeks) which stresses them.
There are various comparisons out there which conclude "datacenter-grade" is largely a marketing/warranty thing; the drives themselves may be nearly identical in design.
I modified some programs of mine that generate a lot of files to read the old version of the file first, compare it with the new version in the buffer, and only write out the new file if it is actually different. This cuts way down on the write cycles to the SSD. It's faster, too!
Why do the SSDs all brick themselves when this happens? It seems like a huge mis-feature; HDDs are almost never recoverable when they fail but if you can't reallocate blocks on flash just go into read-only mode.
In principle, going into read-only mode should work and it should take a while for read disturb errors to corrupt the data. But there's a trade-off that if you're trying to keep servicing writes as long as possible (and retiring bad blocks as they wear out), the risk rises that an earlier-than-expected unrecoverable error will corrupt the critical data structures that keep track of the mapping between logical and physical addresses. Playing it safe means quitting early and thus giving your drive an endurance rating that suggests it is less reliable than the competition.
And it's no surprise that the aspects of SSD firmware that by nature get the least real-world testing and are the most tricky to design would be quite buggy in practice. Even ZFS doesn't try to avoid catastrophic data loss in the face of unreliable RAM.
Only I have subscription overload. Every newspaper and their dog wants to sell me subscriptions but I generally don't read newspapers daily.
I'd love to have access to this data through some spotify-for-text service or Blendle or something though.
I guess I'm not alone in wanting both to pay researchers, bloggers, journalists etc etc, but based on what I read, not based on a monthly subscribtion to every company that I ever want to read something from?
I keep my recurring subscriptions to a minimum too, so I understand. But, funding the procurement of statistically significant numbers of multiple models of SSD drives, running them through to end-of-life characterization, and keeping that all updated as new models come out is a higher spending profile than your typical blogger. It seems more like a business research report or recurring lab test type of service.
Maybe Flattr [0] is close enough? You set a monthly budget, pick things to support over the course of the month, and at the end of the month those things automatically get their slice of your pre-set budget.
I loved this series. It inspired us to do similar experiments with SSDs as we were spec'ing out new servers. I highly recommend doing this so you get a feel for what SMART looks like for your specific SSDs. Its nice to be able to monitor that to have some idea when your SSDs are going to die, especially if most of your drives are aging together.
If you divide the data written at the point where reallocated sectors start appearing by the size, you can figure out the actual average endurance of the flash. That results in:
400 Samsung 840 Series
2344 Samsung 840 Pro
2400 Kingston HyperX 3K
2800 Intel 335 Series
4400 Corsair Neutron GTX
I don't understand why Intel wouldn't just configure these drives to go into read-only mode permanently. If I realized my hard drive had become read-only and didn't suspect hard drive failure, my first inclination would be to reboot my computer, not immediately back up all data.