Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fast Inverse Square Root (quenta.org)
190 points by bertzzie on Sept 15, 2012 | hide | past | favorite | 29 comments


In case anyone is interested in reading other accounts and the HN discussions that accompany them, here are a few previous submissions of other articles:

http://news.ycombinator.com/item?id=213056 <- Some comments

http://news.ycombinator.com/item?id=419166 <- Some comments

http://news.ycombinator.com/item?id=573912

http://news.ycombinator.com/item?id=896092 <- Some comments

http://news.ycombinator.com/item?id=1599635

http://news.ycombinator.com/item?id=2332793 <- Some comments

http://news.ycombinator.com/item?id=3115168 <- Some comments

http://news.ycombinator.com/item?id=3259199


I remember re-deriving this a few years ago. Short answer: If x = 2^n then rsqrt(x) = 2^(-n/2), so the exponent just gets negated and divided by 2. If x = (1 + m) 2^n where |m| < 1 is the fractional part of the mantissa, then rsqrt(1 + m) = 1 + rsqrt'(1) m + O(m^2) = 1 - m/2 + O(m^2) is the first-order Taylor expansion around 1. Thus rsqrt(x) ~= (1 - m/2) 2^(-n/2) so both m and n get negated and divided by 2.

This immediately suggests getting an initial guess for Newton's method by just shifting the bitwise form of an IEEE-754 floating point number by 1 and then doing some fix-ups for the negation.

The only hitch is that the exponent is stored in biased form rather than 2's complement form, so you have to first subtract out the bias, do the shift, and add back in the bias. Also, shifting can cause a least-significant 1 bit from the exponent to shift down into the most significant bit of the mantissa. But that turns out to be useful since it amounts to using the approximation sqrt(2) ~= 1.5. But doing all these bitwise operations is a lot of extra work compared to just doing a single shift. It turns out you can get almost the same effect with just a single addition of an immediate operand. Most of the bits of this addend are already predetermined by the need to handle the biasing and unbiasing in the exponent, and the remainder can easily be determined by brute force (the search space is only of size 2^24 or so) by just minimizing the approximation error over some test set. Doing that, I was able to converge to almost exactly the same 'magic number' as in Carmack's method.

My coworker Charles Bloom has a series of posts on this and related matters: http://cbloomrants.blogspot.com/2010/11/11-20-10-function-ap...

Here's an old email exchange I had with another coworker last time I was thinking about this stuff, where I tried to derive as many bits of the constant as possible by reasoning: https://gist.github.com/4e1d0dff97193fe5d745

Historical aside: How this algorithm came to be associated with Carmack is an amusing and roundabout tale. My partner in crime on Telemetry at RAD is Brian Hook. Hook was one of the first employees at 3dfx, where he designed and implemented the GLIDE graphics API. Gary Tarolli was the co-founder and CTO at 3dfx, and Hook got the rsqrt code from him. When Hook left 3dfx to work with Carmack on Quake 2 and 3 at id software, he brought that code with him and copied it into the codebase. Carmack never had anything to do with it. It's my recollection Tarolli didn't develop the algorithm either--he had gotten it from someone else in turn.


The more-or-less definitive history of the "Carmack" fast inverse sqrt: http://www.beyond3d.com/content/articles/8/


RAD is a company I think would be pretty cool to work at. How's business treating you guys? Any cool projects right now you can talk about?


I've always liked the post provided by BetterExplained[0], but this article is more thorough while remaining simple. Would probably still use the BetterExplained post for a first introduction, I think.

[0]:http://betterexplained.com/articles/understanding-quakes-fas...


This line:

> x = (float)&i;

breaks strict aliasing, and has undefined behavior. If you compile it with g++ -O2, you may get unexpected results. The solution is to use a union, or to compile with -no-strict-aliasing.

I used to have a lot of code like this, which worked fine on older compilers, but not on newer compilers. It's perfectly reasonable code, so I'm not sure why the newer compilers don't produce the expected behavior by default.


>I'm not sure why the newer compilers don't produce the expected behavior by default.

The reason is that there are optimizations that a compiler can do when it knows that two particular pointers will not reference the same location in memory at a given time (which is called aliasing). These optimizations have to do with memory access. If pointers cannot be proven not to alias each other, then the order in which they are loaded, modified, and stored will be very rigid since it is impossible to tell if changing the order will change the end result. Sometimes that's just life; you might have some complex interactions going with your pointers, and it may actually be that the order of operations can't be changed.

But if you can assume that the pointers don't alias each other, then the compiler is free to change the order of operations for better efficiency. It can often remove unnecessary loads and stores (if x and y don't alias, you don't have to reload y just because you changed x), and it can also put loads together at the beginning and stores together at the end (which is faster because, um, caching? I'm not really clear on that part).

The catch is that in general, there is no way for the compiler to know whether two pointers can alias each other. In C99, you can inform the compiler of this with the restrict keyword, but that requires that you do some careful analysis of your code to make sure you aren't lying to the compiler. If you mess up, you may end up with bugs that are very difficult to track.

So that's where strict aliasing comes in. Usually, two pointers of different types won't point at the same memory location. So they added the "strict aliasing" rule to make that fact official, because it means that in the vast majority of cases, code gets a nice speedup without any changes or difficult reasoning. Using a union is basically a way of informing the compiler that you need it to make an exception this one time.

I used to use a macro something like this to make the process concise:

  #define PUN(x, toType) (((union {__typeof__(x) __a; toType __b;})x).__b)
Then you'd do:

  int i = PUN(int, x);         // evil floating point bit level hacking
  i = 0x5f3759df - (i >> 1);  // what the fuck?
  x = PUN(float, i);
  x = x*(1.5f-(xhalf*x*x));


> put loads together at the beginning

Modern CPU's can actually "look ahead" in the execution flow. So for something like this:

  MOV EAX,[memory_access]
  ADD EBX,ECX
  ADD EAX,EBX
The ADD EBX,ECX instruction would be done while the CPU would otherwise be idle, waiting for the memory subsystem to return the result of the first instruction.

Of course there are strange corner cases that have to be taken into account. For example, if the first instruction ends up causing an exception -- maybe the memory it refers to is in the swap file, or maybe it's dereferencing a NULL pointer -- the CPU has to be able to roll back the execution of any subsequent instructions, so it appears to the software as if the exception happened during the instruction that actually caused it.

If you're interested in the automatic program transformations performed by modern CPU's, you can read more by Googling for the Intel processor optimization manual [1] -- warning, PDF link. (I believe AMD publishes a similar document.)

[1] http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...


I actually use macros along the lines of:

  #define LOAD(TYP, ptr) (*((TYP *)memcpy((TYP[1]){ }, ptr, sizeof(TYP))))
  #define REINTERPRET(AS_TYP, FROM_TYP, val) LOAD(AS_TYP, (FROM_TYP[1]){ val })
In practice (that I've found, with GCC), the memcpy and single-use temporaries get optimized away entirely. In strict C99, writing one member of a union and then reading a different member of the same union is undefined in the general case, last I checked. http://cellperformance.beyond3d.com/articles/2006/06/underst... seems to agree, but suggests that every major compiler recognizes it as a de facto idiom and supports it anyway.


Yep, type punning through a union is nonstandard and unportable, but if I'm not mistaken, it is a documented feature in GCC.


Type punning through a union is explicitly defined to work in C99 and C11 (footnote 95 in C11):

> If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’).

Please don't help spread the myth that compiler writers can break this idiom. That said, I use memcpy in my own code.


All right, I see it in C99 as well now (§6.5.2.3 footnote 82, if I'm not mistaken). Thanks for the correction.


Oh, well I stand corrected.


I don't think you can use unions. In the draft of the ISO-IEC 9899 standard (C99) I have, paragraph 6.2.6.1.7 (I'm not kidding) says

> When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.

So no undefined behaviour (nasal demons and such) but still unspecified.


I suppose this is saying something like this

union X{ char a; float b; }

If you store something in a, the other 3 bytes corresponding to b are unspecified

But I suspect this doesn't apply if a and b have the same size.


This was corrected in C99.TC2 with the addition of footnote 82 (footnote 95 in C11).


Finally an article that really breaks it down for someone like me (college calculus knowledge) to understand. Thanks for sharing this, super interesting.


If anybody wants to further play around with this, I wrote the following script in python using matplotlib:

  import numpy
  import matplotlib.pyplot as plt
  import struct
  
  def power(x,p):
    i = struct.unpack("i", struct.pack("f", x))[0]
    i = (1-p) * 0x3f7a3bea + (p*i)
    return struct.unpack("f", struct.pack("i", i))[0]
  
  p = -0.5
  xs = numpy.arange(0.1,100,0.1)
  ys1 = map(lambda x: x**p, xs)
  ys2 = map(lambda x: power(x, p), xs)
  
  plt.plot(xs, ys1, label="Real function")
  plt.plot(xs, ys2, label="Approximation")
  
  plt.title("Graph of x^%d"%p)
  plt.legend(loc="upper left")
  plt.savefig("graph.png")
Change p to whatever you want (even values greater than 1) to see different powers.


A long time ago (the late '80s or early '90s) I needed a square root routine for a floating point number and found an article by a Jack Crenshaw (a real rocket scientist), that always converged in less than 5 iterations. First you normalized your FP number (converted the mantissa to 1.xxxxx) while adjusting the exponent and then multiplied by 1.38 which gave a surprisingly close answer.

I'm guessing that (if I could find the article) the constant used in this floating point technique and the one used in this article are somehow inversely related too!

In any case, it's a very nice article ... thanks!


Gee, you've triggered a memory for me too. I remember doing something very similar (1.38 was the trigger) around the same time. It probably writing floating point routines for processors that didn't have them, perhaps the Z-80.


Not to be too pedantic, but this is the reciprocal of the square root. (The inverse of the square root operation would be the square operation.)


On the other hand, you could parse it as inverse (square root). Inverse of x is 1/x.


There was a nifty article on Flipcode long ago ( http://www.flipcode.com/archives/Fast_Approximate_Distance_F... ) in a similar vein.


I remember seeing the article about the history of this pop up and when it started getting into the maths my brain wasn't keen, but this is a really simple and great explanation of even the heavier aspects of the math involved.


If only there were an equivalent fast atan2.


i've seen this multiple times in the past few years, each time i've seen it i've understood just a tiny bit more... i would love to see an interview with the original author of this...


my impression (from reading the clang/llvm posts a while back) was that they tend to assume that anything with undefined behaviour in the spec returns a value that makes the code faster (eg. by simply ignoring the statement altogether).

edit: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...


I am wondering if in a lot of code taking this inverse square root is not necessary like when calculating length = sqrt(x^2+y^2) in a lot of code, you just want to know a minimum of maximum value so getting the min or max of x^2+y^2 is what you want and calculating the sqrt is not necessary.


why?

Also because x * 1/sqrt(x) == sqrt(x) (so on a pentium, an sqrt was 3 cycles further an inv-sqrt, pretty cheap for an sqrt)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: