Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
There are only four billion floats, so test them all (2014) (randomascii.wordpress.com)
501 points by tape_measure on Oct 11, 2020 | hide | past | favorite | 175 comments


Many years ago now we built something that used a 60-bit truncated hash of URIs. So that's far too small to be comfortable that you won't run into collisions, but it's big enough that for modest sizes (hundreds of millions up to billions) a collision is unlikely. So you can have a slow path for the collision case, so long as that slow path is correct. For unit testing we needed at least one collision so we could see that everything behaves as designed before some real user hits one. We had test data sets with over a billion URIs in them, but none that (as far as we could tell) had a collision.

So I just wrote code to mint nonsense URIs for our system based on a small integer seed (like https://foo.example/123456789), then span up a system with a bunch of RAM (maybe 8-12GB) to iterate through each seed, storing the hash and seed it was generated from, until it hit a collision after a few gigabytes of RAM, then it spat out the two URIs from the colliding seeds. Bingo, now we had data for unit testing to show that stuff works even in the unlikely case that two distinct URIs have the same hash.

Probably a day's work to build that, and then I'd guess several hours every month saved debugging weird collision bugs because our unit tests would now tell us if we'd screwed up and collisions were slipping past.


Figure you need at least 12 bytes per entry to store the ID and hash. Then you can only handle 1/12 of your memory size in bytes (so 1 billion entries if you have 12 billion bytes of RAM).

Much more memory efficient to use a Bloom filter. Build n different ways of hashing the hash (this can be as simple as having some standard hash function H() making the ith different way H(h, i)). Then when you add an entry you write a bit at each of these locations.

Instead of writing right away, you check to see if all the bits were already set. If they were, the entry being added is a candidate for a possible collision with some earlier entry, so stream it out to a file or something.

Once you have the (much smaller) list of a few million collision candidates, you do another iteration over the whole data set, checking each one against only the collision candidates.

The most memory-intensive part is the Bloom filter. You could probably get away with a Bloom filter size of 10 bits per entry or less. (If you want specific numbers, there are lots of online calculators to help you calibrate the size of a Bloom filter.)

But given this is a one-off search, and you had a machine with enough RAM to find a collision, an ordinary hash-based set implementation is good enough and saves the most precious resource of all, developer time :)


This is clever. At the time we needed this machine anyway (this is way before "Cloud computing" was really a thing) but a few years earlier it would have been a good choice.

We actually had most of the bloom filter code you'd need because we used a lot of Bloom Filters in that project. One of the very common queries that system had was "Hey do we know about random thing X?" for which the answer would often (but not always) be "No". So Bloom Filters enabled the system to be able to say definitively "Nope, I know nothing about X" a good fraction of the time without any further work. This actually hurt us in some third party benchmarks we ran for boasting, but helped in our real use case.


I embarrassed myself as a Noogler trying to explain what in my head seemed like a good tradeoff between hash size (also of URIs) and collision handling. The process had fixed storage for a table of known hashes, and every collision was tested for validity before taking action. If you could use a truncated hash, you’d obviously increase your collisions, but if you’re going to validate all collisions anyway, it seemed like you could increase your coverage substantially (in the 60-bit case it would have been 4x) while only marginally increasing your overhead of collision handling.

There's some kind of sparsity/entropy function involved with optimizing these two but I'm too clueless to sort it out. Without the vocabulary to make my case I just got weirded out by a bunch of smart people staring at me quizzically. Sooo..


That kind of thing happens to me all the time. I'm very often unable to think of or explain things in the moment.

Sometimes it helps to write a doc afterwards to get my thoughts in order and decide if it's worth sharing.


It's known as the birthday paradox. There is a probability table here: https://en.wikipedia.org/wiki/Birthday_paradox#Probability_t...


Yes! That’s the part that I was having trouble getting my head around. Looks like 96 bits would have been fine and 128 more than adequate. A small bit of vindication :)


Couldn’t you also test the recovery logic with a hasher that yielded collision more often? It’s like take your system, make it worse (use very dumb hashing), and see how it degrades and/or breaks?


My preference is always to actually test parts in as close as possible to the real state, avoiding mocks unless they're necessary. The more "fake" the system is, the less it tells you about how your software actually works.

So since I could generate collisions for the real hash, that's what I did. If it had been infeasible, I would of course have found a different way to test.

This particular software was written mostly in C (today I would not write it in C) which makes it even more likely that introducing never-used-in-production behaviour unexpectedly perturbs the system. This same system for example ran into an obscure (and since fixed) libc allocator bug, and a defect in the Linux kernel, because we were really kicking the shit out of the memory mapped file I/O with what we were doing.


“Test as close as possible to the real state” makes sense as a rule of thumb, but I’m having trouble coming up with a reason in this specific case how mocking the hash function could go wrong. Assuming your testing/mocking library is reliable, this should be testing precisely what you intend to test. I guess it’s possible that your “more real” test relies on some hidden behavior, like how your real hash function utilizes system resources, which could affect production, but I don’t know that tests which are testing things you’re not even aware of is a great thing to advocate for.


The point of the principle is that you don't know how the mocking can go wrong. Assuming your testing/mocking library is reliable gives you one more thing that can unexpectedly go wrong.

Of course, sometimes getting verisimilar test cases is difficult enough that you judge it not worth the effort over mocking it up and dealing with any issues with the mockup, foreseen and unforeseen.


I think you generally need to either assume or establish that your testing environment tends to be sound, just like you need to assume or establish that your programming language/compiler/computing environment tends to be sound. Your application code and test code itself probably shouldn't be compensating for or attempting to avoid known or unknown flaws in your testing environment.


How do you establish that the testing environment is sound besides having defensive tests? It doesn't necessarily have to be the case that all your tests are like that.


Why not just use a hash function that special-cases two test URLs (or a whole class of test URLs) and produces collisions for them? You can use that same hash function in production.


That’s a stupid principle. If you don’t have your interface well-defined enough that switching out the hashing implementation is difficult, your code is too tightly coupled.

A test handing a hash collision is not a test of the hashing algorithm so you should be able to switch out the algorithm to something super predictable. This isn’t a discussion about mocking frameworks - you should be able to do with by just plugging in a different hash implementation when you construct the test.


These types of blanket rules are useful to ensure a 100% accurate test case. If an error in production costs me $15 Million I don't really care if the engineer wastes a few days to follow a dumb rule.

Also I can employ a $75k - $100k/year engineer and say follow this protocol for testing. If I want someone who can optimize test case 100% of the time with zero mistakes ever then I might have to pay $150k - $200k / year.


That’s not useful. A swapped out hashing algorithm would have been proving the same thing and would have allowed you to test important things like 3+ collisions and two sets of collisions.

Anyone who demanded the real hashing algorithm in this case doesn’t understand how hash collisions work and what problems you need to solve when using hashing. Your “100% accurate test case” serves very little value.

It’s the same as requiring a real person perform the checkout workflow in your shopping cart integration tests. It’s “100% accurate” but it’s a massive waste of resources and provides terrible coverage of the many other edge cases.


It ensures that if someone changes the hashing algorithm, your test for the collision mitigation now doesn't test that anymore even though they should be unrelated.


After going through all that effort to make the test less "fake", I expect tialaramex verified the collision still happened as part of the test.


Yes, actually we also verified that all the hashes were exactly what we expected, because we were using this for content based addressing. "The hash has unexpectedly changed" was a fatal build error.

If we wanted to change how the hash is calculated all extant instances of the system need to dump to backups, then be painstakingly restored after updating. Possible, but certainly not something you'd be doing for giggles.


The only high fidelity mantra in testing is a bit misguided. If you only test your system in as close to production state as possible, you are never going to encounter stress states that would only very rarely appear in production. Seeing how your components independently deal with less ideal components in the system can tell you a lot about how they might handle rare but bad cases in production, and allow you to over engineer them a bit more to avoid that.

> which makes it even more likely that introducing never-used-in-production behaviour unexpectedly perturbs the system.

Isn’t that a good thing? Isn’t the goal to deal with the unexpected in the first case?


Tests come with a non-zero cost. They're code that needs to be maintained. Often complex code, with poor documentation. I don't think the value provided by testing states that you will never actually encounter in production outweighs the cost of maintaining the tests.


This approach works well until you do encounter a situation like that in production :). It's worth testing conservatively because our understanding of our systems is imperfect and because assumptions that may be valid now are likely to change in the future. It's especially true for code that might operate at scale or has security considerations, because both of those can magnify the effects of conditions that occur with arbitrarily low probabilities.

Personally, I see tests as not just a way to prevent bugs, but as a way to learn about the system being tested—both how it behaves now and, continually, how it behaves in future iterations. In this specific case, understanding what happens if two URIs hash to the same value even if they don't "really" do that is an important bit of information: it tells us not only what would happen with a real hash collision but also how our system would behave if there were a bug either in the hashing algorithm itself or in the code that connects the hashing algorithm to the component that uses the hashes.

What happens if we introduce a caching layer between where the hash is generated and where it's used, and there's a cache invalidation bug? I don't know how realistic that is in this specific hypothetical, but I'm sure there are other performance optimizations with the same bug potential that I'm not even thinking of. Changes like that can lead to absolute debugging nightmares! Testing even "impossible" edge cases helps catch this in a systematic way, without needing full knowledge of which conditions might matter.

Of course, none of this talks to the trade-off you pointed out: test code does have a real cost. That is absolutely something to consider. I just don't think the answer is to draw a hard line at "don't test situations that can't come up in production"—and, perhaps naively, I think the "real" solution is less about deciding what is and isn't worth testing and more about reducing the costs of testing more. Write application code in a way that's easy to test and treat your test code as code—keep it clean, include comments, and invest time in setup and tooling to make your tests as easy to write and maintain as possible.


To achieve “defense in depth,” testing components in degenerate non-production situations (and then over engineering them to gracefully deal with those situations) is the only way to get there. Of course, we aren’t building bridges, so maybe it’s less important.


Thats not really a "defense in depth" situation.

Defense in depth is regarding layering different types of security, not targeted testing.

https://en.wikipedia.org/wiki/Defense_in_depth_(computing)


It's an analogy. Multiple layers of defense against bugs.


> The only high fidelity mantra in testing is a bit misguided.

That’s understating it. It’s incompetence because it’s an excuse to not test a huge surface area of edge cases.


> My preference is always to actually test parts in as close as possible to the real state, avoiding mocks unless they're necessary. The more "fake" the system is, the less it tells you about how your software actually works.

I've heard this before, but never actually had that problem with fakes in practise. Where can I read more about it?


Have a production outage. See the code in question is 100% tested. Scratch head a bit. Vow to fake less next time.


Supposedly the software problems with the Starliner spacecraft were also caused by extensive unit testing under test harnesses without integration testing, as a visible example that reached the news.


Or, even simpler, just stub out your hash function and to something that always generates the same value and test with that.


Seems like that would make it less of an integration test


This!

Take over a project with thousands of integration tests.

I trust none of them. Why? Because anything hard to test has been mocked to oblivion.

I have api endpoints that have massive complicated integration tests. I eventually just got source for every project in company and checked to see what they actually did with API.


In this case, I think mocking is the right thing to do. You want to know that you can handle the case where two hashes collide. But for this purpose, you don't care at all how the hashes were generated.

If the website changes its URI hash algorithm, the "honest" test that just supplies two URIs that happened to collide under the earlier algorithm will suddenly become worthless. The mocked test will continue doing exactly what you always wanted it to do -- exercise the "what if the hashes collide?" case in your code.


You still have to reason through “what if it collides with the real hash function?” How do you know it will behave the same if you don’t test it?

(I can sympathize with both approaches and don’t think either one is better.)


I have the inverse - I started at a place that has hundreds integration tests that require a database, k8s, the whole shebang. No mocking anywhere.

Tests take an hour to run, are extremely fragile and flakey. Getting your local dev machine set up to run even a subset of the tests is a half day ordeal, and it can change periodically.

If you have external dependencies that don't version their APIs well it may be inevitable that you need to run tests against them. But for developer productivity you should have a fast/non-flakey path to run local unit tests with no network dependencies. It's also great for working on a plane.


> Because anything hard to test has been mocked to oblivion.

Agree. Not an integration test. If it has mocks in it, it is not testing the INTEGRATION of/between components.

Mocks are for Unit tests. Where you want to isolate testing one single UNIT from all the other crap it integrates with (which have their own unit tests) After all these unit tests run, you test the whole shebang together with integration tests.


they mention this was a unit test, for which stubbing the hashing function would have been totally fine


They said it was for unit tests.


I was wondering why GP couldn't have swapped out the hash function for one that just returns a fixed value, making everything a collision. Then it would just be a matter of verifying the system still works under that condition (ignoring the performance hit).

You could even measure the performance hit from that degraded condition and create a heuristic to warn you if something is wrong with the fast path.


Any reason the hash function and the collision handling can’t be separated and mocked for unit tests? What if you decide to increase the bits of hash values or change the hash algorithm?


This is one of those cases where it would have been easier to fake the hashing function in the unit tests with something that collides when you want. What happens if you want a test case where 3+ URLs hash to the same bucket?


I have a question about how you generated collisions.

If each seed is a 32-bit int, and each hash is 60 bits, then you'll have to store 92 bits for each (seed, hash) pair.

With 2^32 (~4.3 billion) (seed, hash) pairs, you'd need 2^32 * 92 bits ~= 50 gigabytes.

But you mention only needing 8-12GB. Is my calculation wrong, or have I missed something?


Since the hash is 60 bits, a collision is expected after 2^30 tries due to the birthday paradox. So by your math this would be 12.3GB. Although this is ignoring hash table overhead, which I expect to be at least 2x. Maybe they just got lucky.


I think I found a way to do it with ~8GB. As you mentioned you only need 2^30 tries due to birthday paradox.

But, instead of using a hash table to store (seed, hash) pairs, we can just use a list of only hashes. To generate the next hash we use current hash. i.e. next_hash = hash("example.com/" + string(int64(cur_hash))). So now, we just store 60-bit hash values which takes up 2^30 * 60 bits = 8gigs.

The only other problem is detecting if a hash already exists, which I think can be done using a space efficient probabilistic hashset like a bloom filter (and when the bloom filter says the hash already exists, we do a linear scan to find where it exists and we have a possible collision).


Actually, using that method, you'd need much less than 8GB, since a collision would mean you'd hit a cycle, so you could store something like 1 value for every 2^10 iterations. When you found a collision with one of those, you could examine the interval more carefully to find exactly where the collision was.


Ahh yes, that's very clever. Thanks for sharing :)


This is essentially the idea behind rainbow tables, which let you make a configurable time/space tradeoff when trying to find a hash collision vs a fixed known dictionary/space.


Hellman (yes as in Diffie-Hellman) invented this idea (chaining a reduction function and a hash together over and over) for the improved time/space tradeoff. Hellman's idea actually predates the terrible Microsoft cryptosystems which are famously vulnerable to it, because too many software engineers don't realise Computer Science and thus Mathematics are applicable to their work whether they like it or not.

Oechslin improved on Hellman's idea to make "Rainbow Tables" by sightly perturbing the reduction function with the depth, so that collisions are fleeting unless they also coincidentally happen in the same position on a hash chain, this makes his Rainbow Table far more efficient/ reliable for the same pre-computation effort and it produces the "rainbow" picture in his slide deck.

For my purpose Oechslin's improvement was irrelevant of course because I actually wanted collisions.

But memory fades, it is possible either that I got lucky, that I actually had 16GB of RAM, or that I made some minor efficiency improvements that were enough to make it work.


Thanks for the info. I didn't know the history of that at all. And yeah, the head in the sand effect of many devs in relation to crypto is frustrating.


Interesting, that's similar to rainbow tables.


You’re looking for a collision, any of them will do, so your computation doesn’t actually need to go through all the elements. https://en.m.wikipedia.org/wiki/Birthday_attack


You can also truncate hash at 30 bits and use it to address bucket, 12gb then would give you like 3x32 bits, which you can store your candidates. Then compare full hash with candidates.


Why not just run the test-cases using a 16-bit hash function?


Likely a C programmer with little experience doing testing with fakes. Most C projects I’ve worked on are horrible at structuring the code in a way that makes it sane to do dependency injection or mocking.


Did you truncate the hash for dataset size considerations? If you associate data with the hash, I would guess the test data would dwarf the hash size, so why not do 128 or 256 hashes?


The hash is for content-based addressing. We're going to fasten those 60 truncated bits to a 4 bit tag that says this is a URI in the content-based storage. And then we can use this 8 byte value to refer to that URI twice, or fifty thousand times in our system even if the URI is a hundred bytes long - the only problem is collisions, so we need to be sure we notice if there is a collision and handle that correctly everywhere.


Things have gotten even faster since then.

* Target: 64-bit space (2^64).

* GPUs are more flexible and faster. A Vega64 (a few years old now) has 4096 shaders ("CudaThreads") at 1.6 GHz. It can search the entire 32-bit floating point (or integer) space in less than a second. 1.6 GHz x 4096 shaders == 2^42.5 bitspace per second.

* People exhaustively searching a space are relatively patient: and willing to wait months for the results. 60 seconds x 60 minutes x 24 hours x 30 days == 2^21.3

1.6 GHz x 4096 shaders x 60 seconds x 60 minutes x 24 hours x 30 days is roughly equal to 2^63.8. We've done it: a month of GPU effort is all you need to exhaustively search the 64-bit space

-------

We've entered a world (~3 years ago), where it is reasonable to brute force the 64-bit space, so long as you're willing to wait about a month for the results.

If you're not willing to wait a month, then use 2 GPUs, or 4 GPUs, or a cluster of GPUs like the DGX A100 (or multiple clusters).


Note that the article is specifically talking about testing accelerated code. This is usually using machine-specific features, meaning it won't run on a graphic card.

I am also not sure about how fast you are if you need to transfer data out of the GPU, or to add conditional branching to the shader code (it used to be that branches were super slow in GPUs, not sure where they are now).

I could see it working for some other problems though.


> meaning it won't run on a graphic card.

GPUs implement the ceil and round functions.

Heck, GPUs implement popcount, bit-reverse, and a whole slew of instructions that CPUs don't. (Mainly bitreverse and bpermute. Intel x86 only has permute (aka pshufb), but not the backwards version. BSwap can't emulate bitreverse)

> (it used to be that branches were super slow in GPUs, not sure where they are now).

Branches are fine. Its divergent branches (in both CPUs and GPUs) that are slow. A uniform branch will be branch-predicted on a CPU, while a uniform branch on a GPU will have all 64-threads (AMD GCN) or 32-threads (NVidia or RDNA) follow the instruction pointer.

There's no slowdown with uniform branches, either on CPUs or GPUs. Both CPUs and GPUs do poorly on divergent branches, but GPUs do MUCH worse than CPUs.


The article was about testing a specific implementation of floor, ceil, and round using CPU instructions. You can run floor, ceil, and round on the GPU, but that won't tell you whether your new implementation using CPU instructions is correct.


I thought the post was about how the 32-bit space could be exhausted in minutes with common hardware in the year 2014. Which means that any problem I have which is in the 32-bit space is a candidate for brute force.

Given the advances in computers over the past few years: exhausting the 64-bit (or 2x 32-bit) space is "reasonable" for the year 2020. At least, reasonable for people who are willing to wait a month (on one GPU), or willing to buy a cluster of GPUs.


No, the article is that if you have some code that operates on a 32 bit value, it's reasonable on most modern hardware to try all 2^32 values to ensure quality.

There may be some GPU code that is reasonable to test on all 2^64 values, but it doesn't help us verify the implementation of SSE (Intel architecture) ceil/round/etc against reference implementations at all, because trying all double precision values would not complete in a practical amount of time and offloading it to a GPU means we are not testing the SSE implementation that we're trying to verify anymore.


> trying all double precision values would not complete in a practical amount of time

Actually we're getting close-ish: AVX512(=8x64) at 4GHz (or two dependency chains at 2GHz) times 64 CPU cores (which TFA's author has mentioned having) is 2 teraops. Double-precision space of 2^64 / 2Tops is 3 months/instruction on a single computer. So you could brute-force verify a simple function in a year or two, and at least the SIMD width and number of cores is likely to continue increasing.


Keep in mind you need to run the non-SIMD baseline function, too, to perform the actual comparison/verification... So, not yet.

Also, AVX512 throttles down, so you're not going to running it on all cores full tilt at 4GHz today.

I think it's much more reasonable for 64 bit space to have a reasonable set of human-chosen test cases + a large randomized test sequence.


> I thought the post was about how the 32-bit space could be exhausted in minutes with common hardware in the year 2014. Which means that any problem I have which is in the 32-bit space is a candidate for brute force.

No, the post is about exhaustive testing, and that it's completely reasonable on 32 bit values because it only takes a few minutes.

> Given the advances in computers over the past few years: exhausting the 64-bit (or 2x 32-bit) space is "reasonable" for the year 2020.

That's not actually reasonable, it's feasible for very specific contexts.

> At least, reasonable for people who are willing to wait a month (on one GPU), or willing to buy a cluster of GPUs.

And whose code specifically can run on GPU. TFA is about CPU-optimised code, the entire point was to exhaustively test an SSE3 implementations. How do you do that on a GPU?


> That's not actually reasonable, it's feasible for very specific contexts.

I was inventing a new hash function for myself, just for giggles a few weeks ago.

I needed a constant: I started by choosing an arbitrary constant (I started with 0x31415926...), but then I realized that the 32-bit space, and even 50+ bit spaces were feasible on my GPU.

I ended up rewriting the random-number generator on the GPU, and searched for the number which caused the largest number of "avalanche" bit-flips across the entire 32-bit seed space of the RNG.

Where an "avalanche" bit flip is 16 bits flipped, and 16-bits remaining the same, after the RNG was applied to the seed.

For example: seed = 1 * K, where K == 0xAAAAAAAA will cause 16-bit flips, and 16-bits to remain the same.

Repeat for all 32-bit values of seed, searching for the optimal K. 0xAAAAAAAA wasn't the best number (aka: 1010101010101010 binary), but you can see where my thought process was.

My RNG was more complicated than just a single multiply, but I think you can see where the general thought process was with my above example.

--------

There was a time when constants were arbitrarily chosen for these kinds of functions. But today, we can literally test ALL numbers and just pick the best.

If K is constrained by some other requirements (in my case: it must be odd, and a few other tidbits to work with my RNG), then you shrink the search space from 64-bits down to something that can be accomplished in just a few days of search.

-------------

As far as I'm concerned, the blog post is talking about how modern computers have conquered the 32-bit space.

My current post is how the 64-bit space could be feasibly explored. It wasn't even that long ago when 64-bit space was considered cryptographic-secure (see DES crypto algorithm or WEP).


The post was about exhausting the space to test code. The example they used of code that should have been tested this way was CPU code. If you're testing CPU code, exhausting the space quickly on a GPU isn't helpful.


> Note that the article is specifically talking about testing accelerated code. This is usually using machine-specific features, meaning it won't run on a graphic card.

I think there are a lot of computing algorithms you might run on a GPU where you might want to verify the correctness over the range of 32-bit or 64-bit numbers (integers or floating point).


You could also use this to verify 32-bit functions with two arguments.


Sounds like a small and relatively cheap farm could do it in hours or minutes even.


Hmmm... assuming 4-GPUs per node, and 1200W per node, you'll need 216kW to generate the result in a day.

But that also assumes that the operation you're testing is 1-clock tick, which is only true for the most trivial of calculations. Still, 216kWs per day per "test-case clock ticks" shows us that the 2^64 exhaustive bitspace is within a reasonable organization's grasp.


I think your units are wrong. You want 216kW days, not 216kW per day. 216kW days is 18.6Gj.


Hmm, "216kW days per test-case clock ticks" does sound more correct from a unit perspective.


I love this.

This is another in the "Now that we have computers with gazillions of bits of RAM and many cores and gazillions of cycles, what conventional wisdom is no longer wise?" series :-).

I suggested to a client that they use Sqlite3 for their company database rather than a disk based database. And they looked at me shocked, as if I had spoken heresy. I pointed out how many records they could store on a 64GB machine and that is a small data center class machine and it sort of boggled their mind. Add read-only copies, shared from a NAS device gave them perfect scalability in terms of queries per second (their sales folks had their own database that they were updating). It was fun to watch their expression go from horror to wonder to excitement when they finally implemented it.


Yup! I'm currently running a b2c solution for a client that's grossing ~$3 million/year using Python + sqlite3 behind nginx.

I work for another shop that scolds me for even suggesting using something else than k8s. How can a persistent file system possibly scale? Lol. Ironically they're grossing -$3 million/year.


I am curious about your case is the traffic mostly read heavy? Ha wish you could share more. Interesting stuff.


> I suggested to a client that they use Sqlite3 for their company database rather than a disk based database.

Sqlite is a disk based database...


Well, maybe not "disk based" per se. It can be stored on disk, yeah. But it can also run in memory. I think in this case it's clear that ChuckMcM meant a machine with 64GB of RAM (not disk space), therefore they are talking about having SQLite running in memory, not on disk.


Even if a DB is disk-based, you can throw it in a ramdisk to get the same speedups, often. I would be curious to know if there are any drawbacks or dangers from doing that, apart from the obvious “if your box powers off, the ramdisk is gone” and “each time the box restarts, you have to copy the db to the ramdisk.”

The main problem I was uneasy about was, I would simply use the equivalent of rsync -Pa /ramdisk /disk in a loop, which seemed to ensure safety in the event of a poweroff. It never broke, but I wouldn’t say it was a good idea, necessarily.

Sure speeds up most things though. The benefits are pretty crazy for I/O bound programs.


>I would simply use the equivalent of rsync -Pa /ramdisk /disk in a loop

That's probably a bad idea. rsync isn't going to copy over writes in the same order they were written by the DB. Most DBs assume that when there's some crash that at least the disk would have been in a logically consistent state with respect to the order of certain writes separated by e.g. fsync and barriers.


Does anyone know the risk of radiation bit flips in this case? I don't think it's an issue with ecc ram, but if it's not that could be a problem for certain applications.


A friend of mine wrote a 'chaos engineering' tester for this specific issue: http://github.com/enferex/cme

Randomly inject bit flips into a target process memory space, so you can see/test how your software handles it (if at all).


Exactly! I was wondering about that. It seems like it's so rare as not to matter, but that's no guarantee it won't matter.


Bit flip is always a risk even if you don't use ram disk.


'The most common way to force an SQLite database to exist purely in memory is to open the database using the special filename ":memory:".'


https://yourdatafitsinram.net/

The IBM 53-bit quantum compute result from a ton of hard drives + compute power was also intriguing (https://www.ibm.com/blogs/research/2019/10/on-quantum-suprem...).

128PiB of disk space needed for a 54-qubit computation on their system (hard drives). Fun to see where the supercomputers are still needed, but smaller qubits can be emulated on normal computers. 40 qubit calculations probably can be emulated on a 16TB team of hard drives in RAID0.


The trade-off isn’t really size in a lot of cases though. When stuff is stored in RAM you lose the persistence story.


Disk storage for the WAL and periodic checkpoints can get you sub-second recovery targets easily.

Ninja edit: or a UPS and a shutdown procedure triggered by it that writes the database to disk. You can write 100GB in a couple minutes over a 10Gbps network. Faster to a local SSD.


Sub-second recovery and UPS triggered shutdowns are still nowhere near the reliability of a database ensuring fsync has been called when you call “commit”.

There are just certain classes of data (e.g. any kind of business transaction) where you can’t afford this type of best effort durability.


Previous submission 2017:

https://news.ycombinator.com/item?id=15252426

Can't believe this was already 3 years ago. Interesting read!

Previous submission 2014: https://news.ycombinator.com/item?id=7135261


Yes, I originally came across this on HN, likely in 2017 based on my surrounding bookmarks. It has stuck with me ever since even though I don't do much programming anymore. At the time, I was a TA in a MATLAB course, so I tried to emphasize that the code runs on a machine and is fundamentally discrete. (.1 + .2 == .3) returning false is an easy way into this concept.

32 bits (and now 64 as @dragontamer calculated) is a huge input space. You could verify 4 int_8 or 2 char[2] functions.


Articles like this make me wonder how everything even works if basic math functions can be so wrong and how a simple concept like floor can have different definitions/ implementations


It often doesn't to high degrees of precision and reliability. It works within some tolerance range and those tolerance ranges are managed or acceptable levels of error (often unnoticed).

I recently helped move/port one fairly complex numeric model from a language and environment to another language and environment and there was an assumption made by upper management that this was simple and exact deterministic results would be easily had.

I warned that this could be an involved process because of simple variations, lack of equivalent mappings between environments and languages, some of which pointed out in this article.

Management were quite surprised with the level of effort needed to dig down layers of abstraction and identify all these sort of variations in numeric libraries from consistent high level abstractions they were familiar with, and why those high level abstractions simply didn't exist in the other language and environment. They were additionally surprised inthe level of effort needed to get satisfactory error tolerance ranges between the two models when having to recreate high level abstractions identically from the source language and environment to the target language and environment that simply didn't exist prior to the work.

They didn't realize the variation until we performed a detailed analysis on results and found very small but statistically distinguishable results from the two versions. We ended up identifying a few dozen small assumptions they made about their high level abstractions from the source environment that were false and caused the variation.


Six Stages of Debugging

1. That can’t happen.

2. That doesn’t happen on my machine.

3. That shouldn’t happen.

4. Why does that happen?

5. Oh, I see.

6. How did that ever work?

http://plasmasturm.org/log/6debug/


The DNS haiku;

    It's not DNS.
    There's no way it's DNS.
    It was DNS.
https://www.cyberciti.biz/humour/a-haiku-about-dns/


Someone correct me if I'm wrong because it's been years since my EE class.

Basic math ends at division. Heck if you learn how math works on computers, it feels like a number trick. Those are reliable.

The issue comes when you need to track never ending decimal points. Even if every transistor was dedicated to your problem, you'd still need to round. So even if you have 32 bits of decimals, there will be rounding sometimes.

The human has to decide what to do. Instead of converting the binary to decimal and calling it done, it's more important to be consistent and predictable.

The issues described in the article arent 5/2= 2 billion. The article talks about rounding giving you always even numbers, which you can imagine leads to issues if you are using both float and using logic that involves even and odd numbers.

Edit- can someone correct me or add details, I'm very interested but I'm not taking ENG 240 again... :p


Exactly this. The floor function works on binary numbers but has to round them to a decimal place. It's not simple, it just appears to be.


Floor has nothing to do with decimal.


The period in "3.14" is called a "decimal." The digits after the decimal in "3.14" are enumerated as "decimal places." What each decimal place "means" depends on the numeric base.

So, "decimal places" or "rounding to a decimal place" has nothing to do with base 10 (which is what I think you are implying): decimal and "decimal places" exist in all bases, including base 2, and the meaning of each "decimal place" depends on the numeric base (base 10, base 2, etc.).


Sometimes people use "binary point" to distinguish from decimal point and avoid the confusion of the dec- prefix.


"Radix point" or "radix place" are the generic terms I've seen most often.


"Fraction point" and "fraction (or fractional) places" might be a decent alternative that clearly works in any base, while also conveying its purpose.


This is why the big rational data type is so useful on occasion. When you just want to get one calculation exactly right, and performance doesn't matter so much because you aren't doing huge numbers of repeated operations (as in ML), use a big rational with its arbitrary precision numerator and denominator and you don't need to worry about even epsilon error.


> because you aren't doing huge numbers of repeated operations (as in ML)

In ML precision doesn't count so much. It's often possible that we train on 32bits and deploy the model on 16bits. In fact stochasticity is useful and we add it in by dropout or other things. You can drop any connection (or millions of them) in a neural net and get almost the same result. There have been papers that showed that reducing the number of bits used to represent weights might even improve the network because it has a regularising effect. The real brain is also stochastic.


Except if you run exp, log, sqrt, sin, cos etc. Then you immediately leave the rationals.


Yes, it definitely depends on what you're doing with the numbers, but personally I don't recall ever having needed trigonometric functions in my decade plus working life doing standard line of business type stuff.


I can't tell if you are dismissing this, but as an FYI trig is used in Engineering.


I said "it definitely depends on what you're doing with the numbers". That's not a dismissal; I'm simply pointing out that neither I nor many if not most other software engineers use trig on a frequent basis. I'm well aware that it's used in actual engineering, but that's just not what most of us here do.


The answer is that the bugs the article claims are corners cases, most of which will not matter in most practical uses of floating point maths... and some of which will is due to not emulating a weird IEEE floating point rule about rounding numbers of the form N.5 to the even number instead of up like everyone is taught in school.

So "wrong" as used in the article simply means "not the same as the IEEE behavior", not "wrong" as in "wildly inaccurate". It matters if you want 100% exact emulation to get identical results. It doesn't if you don't and just want speed.


> some of which will is due to not emulating a weird IEEE floating point rule about rounding numbers of the form N.5 to the even number instead of up like everyone is taught in school.

Round-to-even (also known as banker's rounding, because it's used in finance too) isn't weird at all. It minimizes rounding error, especially on repeated calculations. The rounding rule that everyone learns in elementary school is bad and results in systematically biased computations. We can and should do better than that flawed approach. This post explains it in more depth: https://mathematica.stackexchange.com/a/2120


>> The answer is that the bugs the article claims are corners cases, most of which will not matter in most practical uses of floating point maths...

Most bugs are corner cases of one sort or another. At the application level you can decide how important the corners are. When you're creating a math library that many others are going to use, and there is a standard, I think the expectations should be higher.


I recently saw a talk (internal only unfortunately) about how some "minor" changes in sin/cos/whatever functions in Chrome broke multiple web sites. It was things like mapping sites and the "minor" changes in the output of sin led to huge errors in the maps.

Which is to say, that minor errors often don't matter, but if you are offering up a general purpose solution then those minor errors probably matter to _somebody_.


It seems to me a bug is socially constructed based on whose assumption(s) are defined as valid. It emerges out of an implicit power struggle among the entities whose interaction reveals the "bug".

To know whose bug it was in your example, you have to make a value judgment about whether the changes were minor and justified, not just a technical evaluation.


I'd say that for most bugs there is a fairly clear right/wrong rather than a social construction.

This bug contains both, I think. The correct result for round(5.5) is arbitrary. The reason that 6.0 is the correct answer is "merely" because that is what the IEEE standard says. That standard could have been different (round-to-nearest odd), but there is value in not lightly ignoring an existing standard.

Other cases are more clear cut. ceil(1e-10) should return 1. This seems non-controversial since ceil is defined as rounding up. Any number greater than zero and less than or equal to one should return one. And yet, some of the algorithms would return zero. That was probably the most serious of the errors.

How serious the errors are depends on how the functions are used, but I think they were all promoted as general-purpose-replacements so they could have ended up being used anywhere.


You should see how confusing modulo becomes of the method isn't explicitly specified:

https://en.wikipedia.org/wiki/Modulo_operation


We did this in XLA. It runs as part of CI, which I always found pretty amazing. https://github.com/tensorflow/tensorflow/blob/bd2608178bca1b...

Found a lot of bugs this way.


It is all very nice, but if your function has two floating point parameters you have 1.6e19 tests to do...


I have worked at two companies where we brute-force checked binary floating point operators. It’s annoying, error prone, boring, and soooooo satisfying. The Holy Grail^TM is brute force F32-FMA. That one really requires a lot of thought, and can’t be reasonably brute-forced on anyone’s budget.


Totally doable. That's just a few hours using 100k cores.


Worth it if your function is acutely fundamental to the operation of your software because your software runs everyone else's software. You buy some time on a compute cluster with 1024 cores and 4TB of ram, and off you go, after making sure you understand your own code and reduce that number back down to "no it isn't, because two arguments in a given space for fundamental maths operations, taken together, almost certainly don't require combinatorial testing" =)


So it only needs 4 Billion times 90s, so is this like never in a million years? (actually it should be only around 12000 year, so you could use a cluster of 48000 Computer and be done in 3 month)


Distribute it over every smartphone on the planet (about 4 billion), and the computations will be done in minutes (of course, that’s dependent on the function being tested).

There will be orchestration overhead and transmission latency, but I guess one could get all results in one place in a day, if one had the ability to do the “run it on every smartphone” thing.


Then you can prove your algorithm correct using proof libraries like Z3.


I decided to build a scatter plot to demonstrate what I was 95% sure was a data loss bug (preproduction, thankfully) in the initializer for a CSPRNG.

I figured plotting a few bytes if output (256x256) would do nicely. I plotted all possible two byte inputs just to make sure it was working, and that ran instantly, so I tried 3 bytes and that was still pretty fast, so I did 4 while I was at lunch or in a meeting.

Sure enough, big black splotches from dropping some of the seed data on the floor. Please to be fixing.

Old farts know a lot of things have been tried before and how those worked out. But we also think doing a complex operation 4 billion times is going to be irretrievably slow. Always test your assumptions.


> For very small numbers (less than about FLT_EPSILON * 0.25) adding 0.5 gives 0.5 exactly, and this then rounds to zero. Since about 40% of the positive floating-point numbers are smaller than FLT_EPSILON * 0.25 this results in a lot of errors – over 850 million of them!

I thought Float Epsilon was the smallest possible number which could be represented by a float? This statement doesn’t seem to make sense...


When I took numerical analysis, the concept was described as "machine epsilon", which is the smallest number that can be added to 1.0 that is not 1.0.

The utility of this metric is that it tells you the best possible relative error, as you can never guarantee a result will be more correct to the true result (in the domain of mathematical real numbers as opposed to machine floating-point numbers) than half of machine epsilon.

This concept is often extended into discussions of "units in the last place" (ULPs), which is the magnitude of the value of last bit of the mantissa. So ulp(1.0) = ulp(1.5) = machine epsilon, ulp(2.0) = ulp(3.9) = 2 * machine epsilon, etc. A good floating-point library will document its accuracy in terms of ulps, so that we might say that the maximum error of the exp function is 1 ULP and that of pow might be 6 ULPs.


There are different "epsilon" constants in different languages that mean different things. In C and C++, FLT_EPSILON is "Difference between 1 and the least value greater than 1 that is representable." (http://www.cplusplus.com/reference/cfloat/). You might be thinking of the C# constant.


I thought that too, but it's actually the difference between 1.0f and the next larger number; so it only represents the smallest difference for the non-dotzero range.

See here: http://www.cplusplus.com/reference/cfloat/


FLT_ESPILON is the bound on rounding error for certain operations. Plenty of numbers smaller than it can be represented in most floating point implementations


FLT_EPSILON is the difference between 1.0f and the next float. With that we can work out the math:

- If you add 1.0f to FLT_EPSILON then you get an exact result - If you add 1.0f to FLT_EPSILON * 0.5 then you get a number that cannot be represented and it will either round to 1.0 or 1.0+FLT_EPSILON - Similarly, if you add 0.5f to FLT_EPSILON * 0.25 then you get a number that cannot be represented and it will either round to 0.5 or to 0.5 + FLT_EPSILON * 0.5

With the default rounding mode that last calculation rounds to 0.5. There are a _lot_ of floats below FLT_EPSILON.


I was reading about an implementation of Barrett reduction recently:

https://en.wikipedia.org/wiki/Barrett_reduction

Specifically one that uses 80-bit long double for calculating `(__int128_t)a * b % M` from this github:

https://github.com/kth-competitive-programming/kactl/blob/ma...

But how do you test the correctness of such code? You can't just stress test all possible product of 64 bit numbers!

In the end you still gotta go back to good old fashion proofs: https://github.com/kth-competitive-programming/kactl/blob/ma...


We've certainly tested all Float32 inputs to some functions in Julia's math packages. I'm unable to find the relevant issues this very moment, otherwise would certainly post them.


I suspect https://github.com/JuliaLang/julia/pull/29700 might be one of them. As the original post states, rounding is (a) amenable to hardware acceleration, (b) easy to get wrong or slow, and (c) possible to test exhaustively.


maybe I dont understand, but is this article only trying to suggest exhaustive testing for single precision, single input functions? obviously this approach is not practical as soon as there are multiple inputs or wider precision.


Yes, the article says "Exhaustive testing works brilliantly for functions that take a single float as input. I used this to great effect when rewriting all of the CRT math functions for a game console some years ago. On the other hand, if you have a function that takes multiple floats or a double as input then the search space is too big. In that case a mixture of test cases for suspected problem areas and random testing should work."

I think the point is that exhaustive testing is feasible in more situations than you might at first think, and than floating-point arithmetic is a particularly fertile source of edge conditions and corner cases, so it's a good place to go for exhaustive testing where you can.


fertile indeed!!!! haha

I should learn to read


In many cases it is practical but people wrongly assume it isn't. In cases where its not practical exhausting a subset can still be very powerful.

E.g. for a function that takes two floats (x,y) test every value of x input with y={0,-0,1,nan,inf,-1,x,-x,2x,x/2,FLT_EPSILON,FLT_MIN,FLT_MAX,random,random,random...}, and then repeat with x,y swapped.

A little knowledge of the domain can help pick better special values, but there is a trade-off in introducing the same blind-spots that permitted bugs in the first place. Still: testing just one argument on all values because that's all you can afford is still a lot better than only testing a couple special values for both arguments.

Exhaustive testing, like randomized testing catches errors you weren't even thinking about-- but unlike random testing exhaustion can reliably catch bugs which are extremely rare. "Partial exhaustion" can have some of the same benefits.


mmm, disagree based on my experience. oh also I think people on this thread may be misusing the concept of float epsilon. keep in mind that is not the smallest value of a float.


Testing all of the doubles may actually be possible [1]. Even if that estimate is optimistic, testing all doubles for order $10k would be feasible for some problems.

[1] https://news.ycombinator.com/item?id=18109432


Not to mention that exhaustive testing only works when you already know the correct result for each input.


In addition to many correctness issues, that code is relatively slow.

SSE4.1 introduced roundps instruction that's both fast and correct. Unlike AVX2 or FMA3, SSE4.1 support is almost universal by now. In most of my current projects, I check support on startup and refuse to run if SSE4.1 ain't supported. Just don't forget to set the corresponding macros for DirectXMath, Eigen, etc.


"When in doubt, use brute force" -- Ken Thompson


My god this issue has been around for too long. In the '80s testing folk learned to shotgun the argument space to test functions, and they've not gone an inch forward since! Or so it seems.

A version of the Pentium was released that couldn't divide by 10. And even back then, testing all the mantissas on a simulator would have taken only hours. Instead, they masked, marketed and released a tragically flawed processor.

Just test the input space exhaustively, folks! It's a computer; it feels no pain. Work it hard until it fails or works perfectly.


On most CPUs, there's plenty of 2 64-bit operands, and with SIMD instructions, plenty of CPUs have more than 1K bits total as input to the instruction. Good luck exhaustively testing those.


In the early 1980s, truncated floating point calculations caused the Vancouver Stock Exchange index value to drift down for about 23 months, until it was shown as less than half the correct value: https://en.wikipedia.org/wiki/Vancouver_Stock_Exchange#Round...


I was once in the position to test 3M possible inputs - that took only a few minutes. It's a good strategy to keep in mind.


Does anyone have code to reduce the search space by a few orders of magnitude to test functions with a float64 argument or two or three float32 arguments in a few seconds? Perhaps by skipping "boring" floats that aren't near endpoints that are unlikely to produce errors that more interesting floats do?


You can't naively do it with the bits of a float, but for most purposes N-wise testing will get you a lot of coverage relative to random or manual cases for little expense. I've had success with it using a library to generate cases testing math kernels with many parameters.

https://en.m.wikipedia.org/wiki/All-pairs_testing


Yes, I've done this using libFuzzer. Supply custom mutation and crossover functions that generates interesting floating point numbers. Then let the coverage guided fuzzer exercise your code. https://rigtorp.se/fuzzing-floating-point-code/


That's essentially what fuzzy testing libraries do. You tell the library what types of arguments you want, write your test, and it puts in some known corner cases and a few random values that change each time you run the test (but are saved as new corner cases when the test fails in the better libraries).


Well, if the ceil function failed on 1.0f, maybe even the ordinary numbers are worth testing.

But... would randomness help? Take a representative sample of like 1/64th of the double space and you might run into most of the common problems.


all floats are equally boring.


No, -0.f is more interesting than 4.1045857301e23f. It's more interesting because it's more likely to fail a test where others don't.


That's a bold assumption: let's make sure it's correct by testing against all floats, instead of only the ones you consider interesting, in the process making any float as boring as any other float again.


Javascript guy here, can you help me set up that Beowulf cluster so I can start testing my 2^64 cases?


I recall hackers building standard password hashes of all eight character ascii combinations. Then create a reverse lookup table. But thats 400 trillion entries for stand 68 character ascii password subset.

400 trillion hashes is not a big computation in the blockchain era.



"Round to nearest even is the default IEEE rounding mode. This means that 5.5 rounds to 6, and 6.5 also rounds to 6"

Does anyone know the reason behind this? why the nearest even and not a hard cut like "always round up with >= 0.5"?


Makes sense when you're refactoring an existing function with one numeric argument, and which evaluates rapidly.


Crap now I have 67 million failures to fix.


> I’ve written in great detail about how to compare floating-point values using an epsilon, but there are times when it is just not appropriate. Sometimes there really is an answer that is correct, and in those cases anything less than perfection is just sloppy.

Certain applications will choose fixed point to avoid what appears to be the imprecisions of floating point.


For python you cannot user the stdlib for this, but it would work for numpy float32.


1. Given the thoroughness of the write-up, I'm going to assume that the author did not find that any of the discrepancies in the implementations were actually bugs in the reference implementation. He goes as far as to say that he reimplemented the reference implantation so this seems like a safe assumption. Still, it would be nice to confirm. Given that other implementations seem so shoddy, it wouldn't be surprising to find a bug in the reference.

2. "Conventional wisdom Nazis" Maybe the author was just bored of writing at this point, but some justification for why bitwise comparing floats was a correct decision would have been nice. Is it because there's no math involving unrepresentable intermediates? Because there's no integration happening, so no way for errors to accumulate?


"Sometimes there really is an answer that is correct, and in those cases anything less than perfection is just sloppy."

Anytime there is a correct answer, and it can be determined exactly (+, -, *, /, sqrt) you should expect nothing less than perfection. ceil, round, and floor fall into this category fairly self-evidently, I think.


Test them all and waste a huge amount of energy and computational resources!

I hope people test like that for really critical stuff, but not for your everyday library for your personal startup or e-business. Not even the big players need this.


"Seymour Cray didn't care that 81.0/3.0 did not give exactly 27.0 on the CDC 6000 class machines; and he was universally respected for making the fastest machines around." ──Linus Torvalds


"Several of the ceil functions were implemented by adding 0.5 to the input value and rounding to nearest."

What sort of clown would make that mistake and yet think they are capable of writing a floating point library?


Can you give an example of when this wouldn't work? I'm definitely not qualified to write a float library (lol) but I also can't think of any cases where that wouldn't work?


ceil(5) = round(5.5) = 6

ceil(5) should be 5


To expand on this, if you have an integer and you add 0.5 then a round() function has to make a choice, because it is halfway between integers. The IEEE standard says that rounding in general (not just for the round function) should round-to-nearest-even. That means that rounding should go to the closest result, and if there is a tie then it should go to the even number.

So, if ceil(x) is implemented as round(x+0.5) then: ceil(4) = round(4.5) = 4 ceil(5) = round(5.5) = 6

The first of those is correct. The second is incorrect.


This is probably about number precision and loosing digits, when you add something with digits only in the beginning like 0.5 to something small like 0.1000000000000000000000000000013124142579275927597245205 or whatever a barely representable floating point number is. Then the result becomes inprecise and rounding to nearest will result in the wrong value.

Such effects and similar should be well known to anyone taking on the responsibility to develop a floating point library.


Every odd integer? The original article discusses it in great detail.


Very difficult due to rounding errors;


How do you do this for a function with more than one or two inputs? If the domain is 4B to the power five, you are out of luck. Very few of us need to implement "floor" or any other function that takes 1 float input.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: